Analysis of the Protein-Coding Content of the 
Sequence of Human Cytomegalovirus Strain AD 169 

M. S. Cher A. T. Bankier, S. Beck, R. Bohni, C. M. Brown, R. Cerny, 
T. HoRSNELL, C A. Hutchison HI, T. Kouzarides, J. A. Martignetti, 
E. Freddie, S. C. Satchwell, P. Tomlinson, K. M. Weston, and 
B. G. Barrell 



1 Introduction 126 

2 Sequence Analysis 1 26 
Prediction of Reading Frames 135 
Criteria for Selection 135 
Codon Bias 136 
HCMV Map 136 
Identification of Homologs 141 
IE Genes 143 

MIE Early Gene Region 143 
HCMV US3 IE Gene 144 
UL37 IE Gene !45 
Eariy and Late Genes 145 
Major Early Transcripts 145 
Enzymes of Nucleotide and DNA Metabolism 147 
Nucleotide Metabolism 147 
DNA Replication 148 
DNA Repair 148 
6.2-4 Deoxyribonuclease 149 
6,3 Phosphotransferase 149 

Early Phosphoprotein genes 149 
Ute DNA-Binding Proteins 150 
Capsid Proteins 150 
Subdural Phosphoprotein Genes 151 
Surface Glycoproteins 154 

6.3.1 Glycoproteins B and H 155 

6.8.2 HLA Homolog 155 

6.8.3 T-Cell Receptor Homology 156 
Gene Families 1 57 
RLll Family 157 
The US6 Family 158 
The US22 Family 158 

The G-Protein Coupled Receptor (GCR) Family 159 
Relationship to a and v-Herpesvirus Genomes 160 
Perspectives 162 



J 

3.1 
3.2 
3.3 
4 

5 

5.1 
5.2 
5.3 
6 

6.! 

6.2 

6.2.1 

6.2.2 

6.2.3 



6.4 
6.5 
6.6 
6.7 
6.3 



7.1 

7.2 

7.3 

7.4 

X 

9 



Reterences 1 63 



MRC Laboratory of Molecular 3ioiogy. Hills Road, Cambridge CB2 20H. VK 



Current Topil-;; in Microbiolog.. jcj imiTiunotojjv. \ oi. i :4 
C Springer-V'crhu 3cr!Mi HciJclhere 



126 M. S. Chee et al. 

1 Introduction 



Large-scale sequence analysis of the AD 169 strain of human cytomegalovirus 
(HCMV) began in this laboratory in 1984 when very little was known about the 
sequence or location of genetic information in the viral genome. At that time 
sequence analysis was confined to the major immediate-early gene (Stenberg et al. 
1 984), a region of the Colburn strain that contained CA tracts (Jeang and Ha yward 
1983), the L-S junction region (TAMASHiROet al. 1984), and what has been termed the 
transforming region (Kouzarides et al. 1983). This chapter is being written in 
March 1989 when the sequence is complete except for some remaining polishing of 
certain areas which is still going on (manuscript in preparation). As far as we know 
there are no major discrepancies in the data which might lead to the sequence 
changing although of course this cannot be ruled out. We present a preliminary 
analysis of the HCMV genome and limit ourselves mainly to the potential protein- 
coding content of over 200 reading frames. 



2 Sequence Analysis 



The sequence has been determined by M13 shotgun cloning and chain termination 
sequencing. In this random approach each base is sequenced many times on average 
so that the consensus produced should be highly accurate. The sequencing strategy 
involved applying this random procedure to each HindlU fragment of the viral 
genome (Oram et al. 1982). However, the high G + C content caused severe 
problems as manifested in the many compressions encountered on the sequencing 
gels. This entailed resequencing many clones substituting dITP or 7-deazaGTP for 
dOTP in the reactions to minimize the effect. All sequences have been determined on 
both strands. Detailed accounts of the methods used are published elsewhere 
(Bankier et al. 1987; Bankier and Barrell 1989). The sequences at the ends of the 
genome which were not generated in the HindlU library were obtained from the 
//mdlll junction fragments C (equivalent to I andQ) and G (equivalent to K and Q) 
which were sequenced in their entirety, and from a portion of the HindlU B (K and 
H) junction fragment from the HindlU W/H end to the £coRI site 21.2 kb 
downstream (Weston and Barrell 1986) (Fig. 1). Sequences were also obtained 
across all the //indlll sites. Double-stranded sequencing on appropriate overlapp- 
ing cosmid and plasmid clones (Fleckenstein et al. 1982) confirmed that the 
sequence was contiguous except for an extra 393-bp fragment which was found 
between HindlU T and E, and which we have named HindlU d. The final map in the 
prototypical orientation of the viral genome with the HindlU fragments predicted 
from the sequence is shown in Fig. 1. As the precise ends of the molecule are not 
known, we have chosen to number the sequence from the start of the direct repeat 
(DRl) found by Tamashiro et al. (1984). By analogy with the "a" sequence of other 
herpesviruses, this is the closest feature to the end of the genome (Mocarski and 



Analysis of the Protein-Coding Content 



average 
strategy 
he viral 
1 severe 



3 



o _ 



□ 























o 








O 


















— 


X 




X 




X 




X 
































> 
















> 





















> 








> 








X 








X 












a 












o 












< 












o 








Ul 




Ml 




o 




O 


— 










> 




>- 




T3 








z 




z 








H 












C 




oc 




















-» 




-> 




U 




CO 




















M 




N 




Q. 




CL 




























a 




a 




























u 
























u. 




u. 




-J 




-i 




























O 




O 


















— 


Q 








u. 




u. 




_l 




-1 












if 








S 




2 








<0 


— 










CL 




Q. 




N 




- N 




















ui 




(0 




•» 




-» 




















EC 




cc 
















t- 




Z 




z 








T> 




> 




> 












o 




o 




UJ 




UJ 
















X 





















lit 

O N ^ 

a. oo 

0-5 a; 
J-E.sr 

« 2 S 

C C 3 
O U GO 

•- p 
ftj ^ — 

•S 

g= 2 
ox 

5"1 a>« 

< — « c 

= ^ - * 
2 « 2 

3 C ^; « 

.2 « |8. 

** e S w 

^ o H 

g O C. J 

5 - 

•£^^^ 

E ^-s 1 

c > w) 
.9 Sb - 

. ^ _o 

u: a; o 



128 



M. S. Chcc et al. 



o c 



-I a> 



.J 

z 

0 
to 

d 



on > LU — 
C O c 

o x: flC .= 
— < c 

1 II I 

« C ea 
w •» * ^ 

S.O 'is 

2 °- o s 

3 S - " 

OS ^ ^ 
<rt « 3 S — 

S -5 o-o 

ir c ^ 
rti O u oo 
C O C 

o o c — 

w o Q . 

■S E 



-8 



"2 5 2 :o 

O Q.i - 
o ^ — o» 

« 5 ^ ~ 



O a 
. o 



J < 

C N 



CO 



I- IT c I 

. _ I 



oC o o ^ c I 



_ rt - 

K — OO 



"S '> ^ ^ 
s a § 

O o. S 
2^ w O ^ ea 

Ho** 

= o ^ S < 

«< « 

J: « o 

o 2| 
= g E ^ 

— c c m 

^ o — 



O to 



o « 



(2 S.:32: 8 



E 

c3 



oz 



2 

fli fts fti . r: 



r o£ o£ a: Of w ^ 

O II li II 11 



vn vo OO 
-J -J -J J 



> > 

II n 



> > 

OO 

II 

II u 



Id 
11 1 

-J -J O J 

> > 

ou-=o 

II2I 

11 II (I 



o w a 

2 o.a 



OO.-H 



— r -i* *< 



>> i 

00*2 
1X5 

II I) ii: 



o 



. ,c c 
■v' *o *« 

ae: o. a 

ill 

OO 



c c c c 
o o o o 
5 5 S 2 

.c c c c 
'5 



o o o o o o o 5 5 o o 
*i e 'i 1 E E E 1 



5.a 
o o 



0 o 

c c c 
w o u 

1 I 2 
f §•§• 



c c 
o o 



,c c C 



^ Si S ^ 2i B " 4> C 'C c 

L. 



EE E 

*2 *2 «2 



-J J 
a: Q£ 



E 
3 



u. 

X 

fN TT fN» ON 

r-j <o fsj 

— OO m rsj ON 

f*! r^i m Tj- 

m <^ — ■ — 



g — ^2 Tf r- 
- — — r3 



0\ r^i rn 
rvj o m m 
O On rvj m rr 
— rsi m 



O fN 

o — 



<^ m — *r» 
m o 00 

ON 00 — 



«rt >0 OO <>4 ov to 

<^ 00 — o o m 
OO fN r«- m OS o 
rsr r^j ON Tj- V-) OS 



— rv> OS m — 
OO TT r- 



Si^ O — O ON TT 
— r>J r- fM On 
C ON NO On NO 

sc so OO 



— r- OO r- 

NO — OO fN 

NO rr OO OO 
\0 *n ~~ 
*N — <N 



NO f- NO 



r- — so o 
rvi OO m o 
vo fN r- 

ON O 



jQsOr^r^oo-^^jOf^-.ooTroo 
™rnrN — CNif^fn vn 



NO m TT 

NO 2- rr 00 

<^ ON OO 

rr »o NO r-. 



♦o vo DC r- 

OC — ON CM O O 

— vo *n rsi «o — 

«/-^ so r- r- OO 



*0 TJ- NO 
<N <^ O^ Tt 

r- ^ p- — 
00 o^ o 



--•TfrNiNO"^«rMm-^ 
ooOmrrONOtorNvo 
— f^f^r^Tf-TftONO^ 



<0 — o f-i 
ON CM rS TT 

rsi m m OO 
00 On Os O 



OO — . OO O 

rn r- TT 

n2) r- — 

OO Ov o — 



— n 
_ -J J J J 

oei a: as a; 
H H H f- 
> > > > > 

uoouu 

XIXXI 



«o NO OO ON — 
OC OC ^ oc 

ooouou 
mill 



— rv* m rr 
3 J 3 3 

^^^^ 

u u u o 

IXXX 



S 5 5 S 2: 2 2 2: ^ 

^^rnr^r^^^^^^v^r-oooNONO 



-rslrnTr»ONOr-ooON2ZS^CJ2[ 

^ =^ :J =^ =J =J =J :J =J -J -J -J J J 
>>>>>>>>>>>>>> 

IIIIIXXIXIIXXI 



Analysis of the Protein-Coding Content 



129 



3 
O 
OO 



C- C" 

c c c c c 

o o o o o 

X K K M K 

U 4> 4i O U 



o 

o 
X 



a o. 
o o 

OO 



o o o o o o o 

w k> t« w 

Q. G. o. a cl a a 
o o o o o o o 
tj <j a u u u u 

ooooooo 



a a 
o o 



OO 



1 E £ e E E i i 

t2 *2 *2 *2 *2 *^ .co , ci 



O 

8 



o o 



i = 

-S b .S 
oS o 
&^ §■ 

o 



o 
o 



.s ^ 

U MM 

o go 

O ej ^ *0 
J= — O 

— ^ a. 3 

W £ w CJ 
3 « 

- c. 

a o- - ^ 

3 30 

:3 s 



m rs — 



u 4> 

c c 



" " B B ^ 



o o 

U O 
00 00 



:= OO O 
- 

«2 - = 2 
3 2 

'u °: 3 3 1 
w < w» -a ; 

^ J? T3 



k. w (O 



O. 

o 



o. 
o 



c So « 

I «> ^ CO 



UU UJ W Q Q 



U X 



.|.-.| 
«2 »2 «2 



<N n 

fN fM <N 
CO CO J 



E E 

CM <N 



'i 

3, 
U 

o 



1 s 

c2 ^ c2 
(N 

(N CM 

-J CO CO 



E 

*i2 



00 



r>-oo — TTf^vor- — oorroo 



o 

X 

00 OO NO 
rn r*- 
m — >o 1^ 

sO fS — 
m fN — rr 



^t^ON — fn — »o — cNr-p-OO^o 



U. U. U- U- 

J J J J rj id 

xxxxxx 

m — oor<^so — oo^<^oof^*om^ 

sOr^i — r~^r> — .soovrn'^'Or^-O'^ 
r«. Tf rn m — — <N — — <N^ 



so 
o 



r>j\OTrrMiNocvo*ommf^ 
*osooo<^<SfS<Nf-r^r*-^ 
— — fsi fN — <NroCM Tfr^ 





m 




ON 




ON 


OS 




OO 






so 




OS 


OS 


OS 






m 






TT 












m 


r- 


OO 






so 


so 


r- 


OO 


OS 


Ox 


o 





<N — ^ 



i 



ra srj O 
O rsj TT 
— m r- 



00O^r^00<N00^00020^O•-^00 C 5g 

O'^O^oo — OOsp-ooo*nr^o ^ 

w^sor-*t^ooor^rsiTr»rir^r-gsr2 ^ 

(N rS rsi «^ ^ 



O o6 K O 
^ o so — — 



— r4 Tf — — r- r- ( 
so <N Tf w-i oo < 
m — — — — — < • 



— fOfM*rioofnw-»f*^ooTr»^oo 
— . — w-jvOtT — r^sOOvOssow^^sO 

00(^00000 — r-'^oornoor^so 
sOOOOsOsO — <N<Nr^mTr-*»rtsO 



TT m fN >0 tT <N «0 — O 

^_«rNirn— OsfNCslTt 

o ^ so fN o rsimmoo 

(-n rr to >o o ooososo 



rr rr 

^ r3 sO 
r^l m m 

fNJ fN CN 



O 

(N O 



CTs so r~ Q 'Z} 

8*o so O O OS 

O r- ^ O OS 

O O r- r- <N 



s 

so 



so 


vO 


00 


TT 






<N O 


r- 


SO 




r- 


— 


00 




*ri 


Os 






»n 


so 




so *n 


so 




so 



rrsorMrooosorsioor^^rooo 

roOOfNrvIOQfNNOOTTOv 

ttonsoso— -^OfNrsi — — .r~- 
-^m^snsosoi^ooos^o 



Os r4 — — 
<<i -^a- *o 

so <*1 — sO 

— <N r*n m 

CM r4 fs| <N 



— mOr^<5soorvjTj-r-sooor^O 
0'*^os0som<o — rosofSr-^oo»o 
r-<N»n<NooosOr^oor-os— sooo 
Trsnsor-r-ooOfN<N^«^t^f^os 

fNfMrs|rs|fS<Nrnrn«^'^<*^**^*^'*» 



8rMvorj-<^«^-7-ooTrso^OTr^'V 
Ttrrso — os«^ — <N — mooO-- 
«nO<M'^Osoo — <NO<NOsmsOfS 

?soooosOxO — <Mfnrnm^'^*A 



UUUU UOUUU CJ 



OUUUUU UUUOO O 



O — rsi Tt 
TTw^sor^ooos — — — — — 

Z3 D D D D D D D 3 D Z2 
>>>>>>>>>>> 

(J <J u o o o u o o o u 
:xxxxxxxxxxx 



^ — — Z. f^S cN r4 rsi iM r^j <N rvi r-^ 

>>>> >>>>>>>>>>>>>> 

Ills llssssssssssss 
uuuu uuuuuuuuouuyyo 

XXXX XXXXXXXIXXXXXX 



— m — 
XXXX X 
LU UJ UJ UJ UJ 

J J -3 -J J J J J J J J -J J :i =i 
X IXXXXXXXXXXXXX X 



M. S. Chee et al. 



4> 00 

CO k 

O -2 



2 

o 

< 

N 
3 
C 



00 



U > 



o. 
o 



C oo 


2^ 














O — 

a. z 






2 9 








"5 ^ 


|o 


n frame 
protein 
id HuAr 


c 
o 




< « 


X 

_c 




q2 


Segments ii 
phospho 
Davis ar 


Glycoprole 


Transact iva 
1988a)* 

DNA repiic 
etaL I98j 



s 

z 

to 

iZ 

•a 
c 

2 
C 



(1- .5 .5 3 

« fti < 

'&! ^ o 5 

M W < 

^ a. c. as 

a- o o O 

T300 M 



O 

a: 

Li. 
X 

ON 



rr TT -rr rr 



w-j ^ m — f*-* rvi 

U. U. U. U. U. L. 

-J -J u ct: Q£ ^ 

u. u. u. u. u. u. 

m X X X 

r^i CM oo r^j Tj- 

v-i O O — O 

ac — <*i — 



C r- r- 00 vc rsi 
r~- On wn sC TT 
w-i m — *o m rsi 



— O 



X X 

8»n o o 
r- ao 
00 30 

m 

O 



nO o >^ 
O «/^ m 
oc rs» 



CM 




TT O O r^i 




^ VO 


sC 




m 3o — a^ r-4 




«n — r^i 




m o 


fM CTv r-- 




O 


o 


m O 


rsi m ty^ sO 


o 


m O 






p- p- r- 


DO 


QC oc o 



^ CT^ 



OO OO On 

(N — rvi r- 
0\ rN r- vo 
m m rsi CM 

— OO 



OC ^ 



DC rM 
OJ On 



CM rj sO 'rt — 

— >0 *n ON 0> 

— (N p- r- o 
rn m m c 

r- r- r- ac 



m rj 00 ON *c 
C OO TT OO o 
rr o rsi r- r- ON 
c m m «ri *c 
r- r- r- r- 



> > 

OO 
X X 



o o o 



On O — rM m TT 
*n w-t «n m »n 
-1 -J -J ^ U U 

>>>>>> 
oooooo 

X X X X X X 



OC OC ( 



00 r- 
r~- wr» 

TT «^ 
o ^ 
OC OC ac 



ooo 



m \C 
tri v-» 
-J ^ -] 

D :3 :3 
> > > 

ooo 

XXX 



NO o » _ 
aofMroao — f^ONrn 
O — r^(N'^»ovoNC 

On On On ^ On O^ 



oooo 



*^w^\CnOnO^'C^O 

>>>>>>>> 

oooooooo 
xxxxxxxx 



NO — Q rsi 
— v> o 

00 Tt — . »rt 



nC OS r*-i 
m r>- m 
O rj- 



OO O rM 
r- o> *n o 
Tf o r-> r^j 
NO OO 

On ON On On 



OOOO 



o OO ON 

nO nO nC nO 

^ U U U 

D D :j 3 

> > > > 

oooo 

X X X X 



OO OO NO OO vo m o 

<N <N so »r> r- 

ON r>« w-i OO fs tt o 

O r*^ ^ \0 

fS tT — NTi OO 



— OO OO NO *n 

— 2S ^ <^ 
Tj- m — r- 



— — — o »^ — 
CM r- ^ OO o 
rr — — 

S SSSS2Z 



r- r- 
r;^ fM fN 
r«. v-i — rn 



NO 

8 



OO 



> 
o 

X 



r- 






NTl sO 


-J 


-J 


r- 


p- r- 










> 


> 


> 


> > 










O 


o 


O 


OO 


X 


X 


X 


X X 



Analysis of the Protein-Coding Content 



131 



= b 
g o 
^ s 

u » u 
ca oo ^ — 

C S < w 

« - z « 



(Z 
-o 
c 

z 

p 



oo 



a. = .S 3 

^ w u < 

si o o 5 

(O W b. < 

03 O. O. 

flu o o u 



Z X 

ip 



«j tt oo 

— 00 oo 

2 0^ 



2 — — 
> 



oo oo 

O ON 




•|0 



« oo 

3 — 

8.= 



^6 



o 



I. 

« oo 

M ON 

3 — 

"> "a 



■SO 

> ■ 

5 Si 

c 5 

o =« 
U 



*> oo '-^ 

uj •— — 

j= — ^ j= ^ 
cu Q a. 



oo 
oo 



5 - 
o ,o « 

^ a. w 
SiO 

2 i 
Q 



o ^ 
■— oo 

• « oo 

a. 

.2 « 



.£ 

4> m oo 
^ oo oo 

J « « 
> 



:3 D 





oo 


oo %o 


oo 






o 




CM 




NO 






r- 




13^ 




oo 






o 




O 




rr 






^0 


30 


CM 








00 





u. 
-J 



oo 


oo 


^0 




*o o o 

CT^ u-i O 


oo 






tn 




m 


OO 


oo 


On ON 






m 




rN — r^i 


r- 






r- 


— sO NO 



</-i «— 

O ON r- 
00 NO r- 



*n oo fN TT rvi n 



— c 

rvj — 

oo 



r- 

rsi ON 

NO fS 



rsi TT 

NO oo 

oo — 
fN fN4 
TJ- — 



r- oo 
m o 

NO 

00 oo 



5 i 



— oo oo *0 

— Og NO Tt <N 

m » ^ r- m 



s 



— oo 

m ON o 
CM r- 



nO ON ^ NO NO O 

— »rt no oo o r- 

— w-i m 



— O 00 

Tt rNi r- 
On rr rn 



sO — — TT <^ nO 

SO — O On Tf 0^ 

— fN »n <N 



— M-i r-* 
m — O 
— 



oo o* r- — 



ON r- 

On 

rN| o 



— — O »A rn — 

*rt in oc o 

^ — «n — f-i 
*Q NO o — 

O o O 



3 



rsj r- r-i 
rr r- 



oo ON Tf sO O 
W-i oo On O ^ — 

NO — O O 
r~- O^ — Tf" oo 
— — *N CNJ CNJ fS 



r— m o\ 
r- NO 

— TJ- NO 

— PI 



O rvi r~- m 

nO o\ oo O 

o^ — r- rr oo 

tt tt o r>- oo 

<*l r*^ 



8 



— r- «n <M r*- 

rNi m <N m rv* 

r-* »n — rn 

»£2 »o O O 

o o o — — 



o 

ON 

o 



TT — oo 

rvj NO Os 

ON — — 

rsi «^ >o 



crt On — «n 

NO \0 rNj On 

— O O O <N 
On ~~ f*l OO 

— CM CM CN) 



\0 O — <^ On 

m m rr — oo 

oo oo r~- 

m rn Tj- Tt nO oo 
d f\ 



oo nO TT 

OO — oo 

O 
oe Q o 

^ rf 



O ON 



oo 

S8 



fN* NO — On < 



oo 

oo rsi o 
rs» »o 



* — — r^ - 



oi m m 



ON rr C <^ 00 rsi 

m oo r-* o^ Q oo 

NO I — O nO O 

m rn tt rr sO r- 



.o — 
rsi — 



ON m 
CM W-t 

rvi m 
»r» NO 
rr 



uuuuuu 



oo 



> > 
o o 

X X 



o — ra rr NTi so 
~ r- r~- 

^ U J U -1 

D D D :d D D 

>>>>>> 

o o o o o o 

X X X X X X 



•1 
> 
o 

X 



D 3 D 
> > > 

ooo 

XXX 



— fN v% NO 

00 oo oo oo oo oo 
^ -J J ^ U U 

>>>>>> 

oooooo 

X X X X X X 



X 
UJ 
oo ON 
oo oo oo 
^ U ^ 

> > > 
ooo 

XXX 



X 
UJ 

O — rvl rr Ov 

On ^ 0> ^ CS oo 

-J ^ ^ ^ ^ 
>>>>>> 
oooooo 

X X X X X X 



sO r- 

ON ON ON 

-J -J ^ 

> > > 

ooo 

XXX 



oo 

On on 

DD 
> > 

O O 
--XX 



8o 
Zi Zj 

> > 

o o 

X X 



> 

o 

X 



SS 

> > 

o o 

X X . 



132 



M. S. Chec et al. 



Si I 



t- 00 



o 



00 "D 



T 2 



CO UJ — • 

.ia 1 

0^(3 



3II 

O fc "> 

x: i J3 



z 
z 

-"I 



o 



c c 

o o 

K X 

c c 



. OS 

I o 



z 

IS- 

c/i 



O. 

O 



O O O O Q, 

a a a. 



0000 

5- o s s s < 
o ooooHJ 



< « 



uj 

CO 

Z 



,5 OS 
-a — 

O 

'J S 

s ^ < 



^ „ D. 



STj- — tt rM 

r~- O O rsi 00 

*n m ir» 

tt r>- tt ~ Tt — 00 



OCsTTOSOsoOOOTr 



m o rn 00 r- 00 
^ rs On fN o 



<^r^rrrM— CM — Tt 



O ON so o — 
m so 00 — ON — 
rn 00 m 00 rvi ON 
«-» NO 00 OS ON 
.v^ «o «-» io 



s; 



m ON OS 

w-v — o 
<~sl OS 



rn m vo 
00 rr 
OS so rsi 



o o so so OS 
tin — — o^ r- — 
OS Tf o sn 00 tt o 
^ so On OS 

<n »o »o in *n 



osmr-sOsrjTTos^o 
s0w^^00»r>OsOm 
rn-^tTivor-ooooo* 

vC^0^0s0sO^O\0^ 



OS 

8 



TT 00 
00 — 0 
r~» <— 



00 uuu 



(JO 



> 



-J -] J J J -J J 

>>>>>>> 

ouououo 



„ — ~ — — — — — <Mr*<N 

J J Ij 3 j j Z3 3 3 □ Zj 

> > > >>>>>>>> 

s s s ssssssss 

u 0 u ouuuouuo 

XXI xxxxxxxx 



> 

u 

X 



X 



-J 

> 
o 

X 



X 
UJ 



-J 
> 

o 

X 



J -I J 
> > > 

XXX 



V 



Analysis of the Protein-Coding Content 1 33 



O Q 



5 



m it 



O 



Id B H-^ ^ 

o a o ft 9- 



t 00 

£ O ON 

C 2- ^ 

o ui — c 

K ** *;r: 

C/3 n « 

- 00 So o * o 

On X * ■■ 



On On w On X * U 



o o 



00 



o. 

o 



o 



^ ti. * 

" " so 



O u 

a. >^ 



2 

■« e - H 

o « -J > 



OOZo 



rsj o ' 
• « — ON 00 r- NO 



o. 

o 



5 =: S 

.S w ^ Q. « «^ C 

. u o ^ ■- » 

OS O ^ ^ 5 a"^ 

HG §555-5555 J III 

O C II II II II II II II II II II II II n £ > 



J2 -S 00 



Q. — 

O 00 
u 00 

•c «>" 
oS o 

CO 



E 



III 

— rsj CM 
CO C/3 V) 



o. o. 

o o 

00 



E E 

NO O 

CO CO 

3 D 



00 NO 00 r<^ m O 
rn 00 »ri ^ w-t 

rsi o <N NO <N ON 
NO m ^ 00 ON o 



— \0 f*! 
NO O 

NO ON 
^ rvj — 



^ r;4 00 NO 

On m r-» r^j 

TT On r<l 



ao ON 
— rM 



-J 

I 

"tT CM rr 
Nn rsi rsi TT 
rsi 00 
m r^j Tf sO 



C 
X 

o 
Nn 
O 



CM — 

u. u. u. 
-1 -1 ^ 

aao 

XXX 



NO vn 

X X 
X X 



— fN NTi ON — o — 

00 — r- 00 t/^ r- 

^ — »n O TT nO fN 

m — m 

fNi rsi — . , 



m ON ON 

»o — o 

(N TT On 



m TJ- 00 
00 — 0 

r~- — 



) TT sO 0 


m 




nO Tf — 


• — r- r- 


00 




— m r- 


• 






Tt (N — 


• 00 rst 




ON 


NO On vO 


> m 3 TT 


rs 


00 


m — \o 


rr 0 00 




NO 


0 00 m 


I <0 r~- 


00 


00 


00 — 


r» r- 


p- 


r- 


00 00 00 










NO — 


Tt 




— KTi 


0 r- 




r- 


-rt 00 


m 00 r- 


m 


0 


0 CM 


NO NO 


QO 


00 


00 — 




r» 


r- 


00 00 00 










«0 TT rr 


NO 




NO 0 n 


NO m 
NO >o On 


r>- 


rn 

rM 


00 -g- p- 
p- 0 


10 vo 0 


r- 


00 


00 0 0 


r~- r>- r~- 




r- 


p^ 00 00 



NO O Os fM 

vO Tj- sO to 

ON r«i 00 ao 

— fvi r^J rn 

00 oc oc 00 



r«-» TT O 
00 rsi CM 

*o »o 

rvi 
00 30 00 



00 sO p- 

m ON Tf 
»n W-i TT 

— — fv| m 

00 00 oc 00 



rvi 

00 oc 
r^* >o 

oc oc 



O r^i 

NO 

00 C 

ao oc 



>c ^ <^ o 

f»>j r** *o 
«n 

nO 00 o 
oc oc oo 00 



O oc 

m nO 
ON r>i »o »o 
p^ 00 
oc oc 00 00 



P- O 

NO «0 PN* 

On 00 ON 
rsj m rj- 

On On CS 



NO O 

— On 

r- NO 

On O 



OC ( — vn o 



o — — 

rn *n tn 

r^l On O 

NO »0 P-. 

On On On 



f^I m p- 
O O P- 

00 TT r*-! 

V\ V~\ \0 
On On ^ On 



ououuo 



X 



> 

u 

X 



X 



> 

X 



> > > 

CJ u u 
XXX 



P- 00 On O — pNl 
PNi rvi pNi m m 



-J -1 -J J -1 -J -J 

>>>>>>> 

uuuuuuu 
xxxxxxx 



> 



CM — O 

3 31] 
> > > 

XXX 



ON 00 P^ NO 

-J -1-1 J 

oc oCaCQi 



uuuu 

X XXX 



-1 -1 
06 06 



>>>> >> 



X X 



— 

-1 -1 -1 _ 
^06 OC — 

>>>? 

O vJ O CJ 
X X X X 



CO 

> 

CJ 
X 



— rN 
CO CO CO 

> > > 
XXX 



tt NO \o r- 

CO CO CO CO 

> > > > 

OCJOO 

xxxx 



134 



M. S. Chee ct al. 



E 
£ 
o 
(J 



o 



C 
< 



00 



.y M .H .y . 



c c c c 



D. O. 

O P 



O. O. 

o o 



JD Jii 
o o o < 

JZ JZ JZ J 
O. O. O. I 

o o o < 

fc. k. W I 

•O -D T3 T 
>^ >» >^ : 
£ j= j= J 

: 

a. Q. a.' 



X5 JD -O 

I o o o 
; j= ^ ^ 
L o. Q. a 
t o o o 
. k. 

( -o -a -o 

N >^ 

IJZ JZ M 

%^_>*^ 

L O. O.'S. 



O O U 'H' 

,5 .= 
o o o *i 
£ j= .c o 
o. c o. g 

O O p Ol 

k. W k. . 



CO 



sis: 



a. Q. o. ^-^f 



o 
E 
o 
X 

15 a. 

° 3 
j= o 
a w 

2 = 



E E 'i 'i J J 

^ ^ a« J? 
t_ «— I— ^ ^ 







_>» 










E 


rami 
fami 


fami 


fami 


E *i 


fami 
fami 


fami 
fami 








rM 




US12 
US22 


US22 
US22 


USl 


3 3 


USl 


USl 


W3 CO 



fi ti fi tt^ 



U. U. U. U. U. 
J U ^ ^ J 



Tt — NO in Tt . . . 

U.U.U.U.U.U.U.U.U.U. 

xxxx>>>>>>^>>>> 

XXIXXXXIIIXXXXX 



o o 



a\r~>vOf*^^nmrnr-o — rs — — r^o 
OvO^OnOOOOOOOOOOO — 



o* >0 m fN 



— — — rN rsi r^j 



s 



0(7^0^0^00000000000 
rM<^rMrg<N<NrM<NrN*N<N 



O — rvimTT^or-oooO— rsi 

ooo — rMfNr-i 

COV3COCOCOCOC/1COCOCOCOCOCOCOCO 

>>>>>>>>>>>>>>> 

uuuuouuouououuu 
xxxxxxxxxxxxxxx 



II 

l-M »■ 

CO U 
DO 



u. u. u. u. 

XX XX 

XX XX 

vo oc rsj so 

oo IN o 

OO O ^0 O O 

30 r- o o — 

vo i/^ — Tf 



fN O O (N 

o o r- o >o 

^ vC r-i 



O <^ 

— O r- oc 
w-t — O O 

<rt <rt oo 



O o mo 
TT O VI O 



p- — o o 

— o o *n 
»o o r- DC 

— »n »A 
oj rj 



OOUUOUUUOUUOUUU CJU 



rvj CM 
CO CO 

> > 

uu 

X X 



IT) so r~- 
o* fM 
CO CO CO 

3 3 3 
> > > 

XXX 



o 
E 
o 
X 

dl 

15 o. 

o = 
^ o 

Si 

-O W 

£; 3 2 e 



2^ 
E 

u 
o 



Q. 

o 



CO 



> » 

O U 

- -5 o ^ 

•2 I ==: 

S 1-2 £ 

— <U O 
S CO Q. 



X 
X 



8 



o 
o 



CO 

ID 
> 

vJ 

X 



E E 

CO CO 



Tj- ^ vO r- m oo fvj 
Lu U. U, U. U. U. Lu 
O: pj 0£ ^ 

X X X X X X X 
X X X X X X X 



O — oor^r-o**»o 
»Om<NfN — — — — oo 



CO 

3 



-J 
X 
X 



NO oo »r» o 

rsi — 00 oo 

Tj- NO m rr 

O — TT 



or-Tr*o»noorNio»n 
fN*r^r»-«Nr*o — rj — 

o — <NmTrTrw^u-tNO 
rvir^rsi<NfNrsirsfNrsi 



oo — rNimTTw^vO — 
fNmrnrnmmmmco 
COCOCOCOCOCOCOCOQ£ 

>>>>>>>>> 

CJUUUOUUUU 
XXXXIXXXX 



> 

o 

X 



o. 

o 



E E 
3D 



GO 



- '-5 2 



c 
u 

^ Q> O 



• > 
O 



on 



Analysis of the Protein-Coding Content 135 

RoiZMAN 1982; TAMASHiRoet al. 1984; Spaetc and Mocarski 1985b). Our sequence is 
numbered from base 2352 of Tamashiro et al. (1984) but reading backward on the 
complementary strand. It contains a single copy of a DRl -flanked 578-bp sequence 
at each end and at the junction of the internal repeats. The sequence we have 
determined consists of 229 354 base pairs. The long unique region (UL) is 166 972 bo 
and the surrounding repeats (IRL and TRL) are 1 1 247 bp each. The short unique 
region (US) is 35418 bp and is flanked by 2524-bp repeats (IRS and TRS). In the 
sizes given above, IRL and IRS are considered as overlapping by one copy of the 
DR 1 -flanked repeat unit. The long repeats are identical except for two base changes- 
a C at position 5288 and a G at position 8293 are both substituted by As in the 
equivalent IRL positions. The former change does not affect any predicted coding 
sequences, while the latter affects TRL/IRLIO (Table 1). Two differences were also 
found in the short repeats: in IRS, an A at position 189887 and a G at position 
190332 are substituted by C and T respectively in TRS. The former difference is 
silent while the latter changes a valine residue in HCMV-IRSl to a leucine in 
HCMV-TRSl. 



>o r«- m 00 
If. U. Lu U. U, 

X X X X X 
X I X X X 



X 
X 



>o 00 *n >o rv( m 
«o vo >o 00 
^^ o f** r*- o^ 
fN fN VI tN «N m 
00 



3 Prediction of Reading Frames 



Very little of the genome has been mapped in terms of its transcription or its 
expression. In order to analyze the protein-coding content of the sequence we need 
to define the criteria for the selection of the reading frames we think are most likely to 
be coding. A description of the procedures we have applied is given below. 



00 00 00 — 
o f*i 00 no m *n ^ 
"J TT o 
n TT tt to 00 
>j rvt rvj IN (N <N r^* 
^< rst fs <N CM <vt rsi 



00 p- 



r »o 00 <N o\ >o 
- <N 1^ O — <N — . 

3 m O fv| Tt — 

1 ^ iri vO 

J <N <N fN rsi (N r>i 
1 rsi (N <N <N (M rv» 



uou 



- rs» m Tj- vrt so — 

1 *n en m m c/5 
O C/) CO CO CO CO ctf 

>>>>>>> 

cxxxxxx 



> 
u 

X 



3,1 Criteria for Selection 

Analysis of other herpesvirus genomes shows that in most regions the reading frame 
that is coding is the longest and that such reading frames are arranged end to end on 
either strand with very little noncoding sequence in between. Very few overlapping 
genes have been found although there are sometimes small overlaps at the 
beginnings and ends of genes. Thus the strategy we have adopted has been to screen 
the sequence for reading frames that are over a certain length and then to filter out 
any smaller frames that overlap larger ones by a certain amount. The cutofls that we 
have chosen are a minimum length of 300 bp (i.e., a coding potential of 100 amino 
acids) and a maximum allowable overlap of a larger reading frame of 60%. This 
latter figure allows for the fact that a reading frame may be open upstream°of the 
actual initiation codon and that this may lie under the preceding gene. There are 778 
reading frames over 300 bp of which 58 1 are screened out on the grounds that they 
are overlapped extensively by larger frames, leaving 197 candidate protein-coding 
genes. The sequence is then examined for reading frames of less than 300 bp that may 
lie in the gaps that are left. Likely frames are selected by experience using criteria 
such as logical combinations of potential transcription signals with the reading 



136 M. S. Chee et al. 



frame and any potential translational start; homology to other reading frames or 
known genes; and the presence of protein structural or functional motifs in the 
amino acid sequence. Codon bias can also be used as described below. The whole 
procedure will not work where genes are spliced and the exons are small. In those 
regions of the genome where the genes are highly spliced or in regions which are 
noncoding, small background noncoding reading frames will have been included 
which would otherwise have been screened out if larger coding reading frames were 
present. We think that this is particularly true in and bordering the repeat sequences 
and in certain regions of the Hindlll D and E fragments. In a few cases we have 
substituted a smaller frame for a larger overlapping frame where we have found 
compelling reasons to choose the former. 



w 

I w 

r 
»: 

CI 
(A, 



3.2 Codon Bias 

Patterns of codon usage that could conceivably be generated only through the 
genetic code are, in the absence of any other criteria, the best indication that a 
sequence is coding for protein. The high G + C content of HCMV (57.2%) leads to 
an accumulation of G and C in the third, degenerative, position of the codons. This 
is because in an average amino acid sequence the excess G and C cannot be 
accommodated in the first and second positions without biasing the sequence to 
amino acids encoded by GC-rich codons. Figure 2 shows a G 4- C plot across the 
entire sequence. As can be seen there is considerable variation in the G + C content 
across the genome, particularly in the repeat areas, the regions bordering the 
repeats, and the HindlU D fragment. Because of this variability we have not yet been 
able to find a single formula that we could apply equally to all areas of the genome to 
justify further our selection of reading frames on the basis of size and position. 
However, codon bias does serve as a useful check in those areas with a high G + C 
content. 



3.3 HCMV Map 



The preliminary map of 208 reading frames deduced from the sequence using the 
criteria discussed above is shown in Fig. 3. Details are given in the figure legend of 
individual frames that we have omitted from the original set of 197 (Sect. 3. 1) and the 
criteria for inclusion of replacement frames. Although some of the frames shown are 
unlikely to be coding (for example, U L 1 26 which overlaps the (noncoding) exon I of 
the major immediate-early gene and part of the enhancer) we preferred to include all 
frames meeting our minimal criteria unless a more plausible alternative candidate 
could be identified. 



Analysis of the Protein-Coding Content 



er reading frames or 
:tionai motifs in the 
ed below. The whole 
IS are small. In those 
in regions which are 
have been included 
. reading frames were 
the repeat sequences 
a few cases we have 
/here we have found 



0+£ 



□ 



2d only through the 
est indication that a 
:MV(57.2%) leads to 
n of the codons. This 
G and C cannot be 
sing the sequence to 
J + C plot across the 
in the G + C content 
gions bordering the 
we have not yet been 
reas of the genome to 
of size and position, 
as with a high G + C 



e sequence using the 
1 the figure legend of 
197 {Sect. 3.1) and the 
the frames shown are 
noncoding) exon 1 of 
referred to include all 
iltemative candidate 



t; c 

rsl > wi 

I- - I 

i > +i 

« 5 i? -S 

^ (/] I- u 
tr^ 1/3 r- 
OO fli — ~ 

c £: u 

^ 3 g W 

u y O 
2 = c 

r - a " 

o nr o 
^ oo-^ .2 

— w y 
G.je 

- o a o 

&! y _ 
« r « « w 

nils 

> « 3 a- 

«^ o c o S 

" c •= ^ 3 
u i: 5 c 

O 00 ^ 4j 

c c Si O 
ao— — CO .!2 

Q W « c 

.£ ^ -a ^ « 
« — c: — I- 
i: — > o 

O ON o c , 
X - 52 ■- -5 

§ ^- - 

— 9 i»rn on 

o ^ S £3 
c <C o -J 2 

- — tij w .E 

u 5 - = 

■a < .a 3 -o 

ri d o -J 

1 ^ ^ 



M. S. Chee ct al. 




Analysis of the Protcin-Coding Content 




I 

I 



140 M. S. Chcc ei al. 



« 4 



I 

IS 



5| 



^ 4 



5 ♦ 



I- 



s ♦ 

-t' 

= t 

= 1 



1 = 



Analysis of ihe Protein-Coding Content 141 

4 Identification of Homologs 



The HCMV protein sequences were screened against the PIR (release 19.0; George 
et al. 1986), and SWISS PORT (release 8.0; Bairoch 1988) libraries using the Fast A 
program of Pearson and Lipman (1988). Searches were also performed against a 
herpesvirus protein library including HSV-1, VZV, and EBV sequences. In these 
library comparisons alignments were examined when optimized Fast A scores of 90 
or greater were obtained, although in some cases lower-scoring matches were also 
scrutinized. Some of the HCMV sequences match numerous reading frames as a 
result of compositional bias, which may be general throughout the sequence or 
localized. For example, glycine-rich stretches occur in a number of reading frames, 
including HCMV-UL44, 56, 102, 1 12, and TRS/IRSI. In most cases highly biased 
matches have been excluded. Sometimes, however, these similarities are likely to 
; reflect functional similarities, if not homology. For example. HCM V-UL 1 22, which 

encodes an immediate-cady transactivator, is similar to HSV-IEllO, also an 
\ji immediate-eariy transactivator. The results of overall homology searches, motif 

* ^ searches (Staden 1988), and comparisons of gene layout with EBV, VZV, and HSV- 

: 1 have been amalgamated in the compilation of human herpesvirus and cellular 

homologs. Functions ascribed to HCMV genes or their homologs are noted in Table 
1. Homologies detected to the sequenced herpesviruses are shown in Table 2. A 



1 = 



Fig. 3. A map of predicted open reading frames in HCMV strain ADI69. Two hundred and eight 
mdividual frames are recognized, some of which are known to be spliced. The reading frame map is drawn 
in the prototype orientation below the HMIU restriction map. The diagram is scaled in kilobase pairs 
Open reading frames which overlap on the same strand are displaced in the figure. Frames are numbered 
separately except for three genes for which sphce sites have been precisely located (HCM V-UL36 UL37 
and ULI23) (Kouzarides et al. 1988; S-resBERC et ai, 1984, 1985), and one gene for which the splice sites 
areprobably.conscrved withotherherpesviruses(HCMV.UL89){CosTAet al. 1 985). Genes which maybe 
sphced to upstream frames, but which are also capable of being initiated at a proximal ATG are 
numbered separately (HCMV-UL36, UL38. ULI22). Frames are designated TRL, IRL, UL,TRS. IRS, or 
US according to the region of the genome in which their 5' ends are located, and each of these six set's is 
numbered from /. A frame which spans the DRl repeats (Sect. 2) and hence is capable of crossing the 
genomic termini has been designated J (junction) /. Three manifestations of this frame which differ in 
their 5' and 3' termi ni occur, and are shown as 7 / L. J / S. and 7 // (where L, S, and / denote long, short and 
iniemal respectively; see also Table 1 ), The "a" sequence is shown as a thin vertical line located within the 
repeats. The following frames have been included in place of longer overlapping frames; the names of the 
latter (not shown) are given in brackets, together with reasons for the substitution; the orientations of the 
substituted frames are indicated by the direction of numbering : i, JI L, and TRLl (TRLIX, positions 
291-1361; these frames occupy the region more completely, with minimal overlap. TRL! has a proximal 
TATA box and a Kozak consensus ATG). [NB. J 1 L completely overiaps a frame equivalent to HKRFX 
(Weston and Barrell 1986) (not shown, positions 873-43)]; 2, UL38(UL38X, positions 51 098-52 141- 
third position G + C; see Sect. 5.3); i, U L 106 (U L 1 06X. positions 1 55 043 - 1 55 465; third position G + C); 
4, ULl I2(UL1 12X, positions 161 638- 160 466; third position G + C; mapping data; Wright et al. 1988); 
J,ULI23(UL123X positions 1 72 33 1 - 1 72 8 1 6; overlaps major immediate-eariy gene exons 2 and 3); 6 J 1 1 
and IRLl (IRLIX, positions 189 176-188 106; see / above). US25X (former name HHRFl, positions 
215051-215 518; Weston and Barrell 1986) had an excessive overlap with US25 and was omitted 
without another frame being substituted in its place. The small frame Ul 1 1 1 A (marked as A) was included 
because it has a Kozak consensus ATG, a transcript has been identified in the region, and it is a conserved 
feature ofatransfonning region in HCMVsTowneand ADI69(RAZ2AQUEet al. 1988;JAHANet al. 1989). 
The frame is one amino acid shorter than the Towne sequence, having a relative 3-bp deletion, but the 
predicted amino acid sequence is otherwise identical 



142 



M. S. Chcc ct al. 



Table 2. Homologs of HCMV-rcading frames in the sequenced herpesviruses. Internal HCMV-relatcd 
sequences as well as EBV. VZV, and HSV-I homologs are listed, together with FasiA scores (Pearson 
and LiPMAN 1988). HCM V homologous families containing three or more sequences are indicated only in 
Table 1. We have found from experience that FastA scores above 100 are often significant, except when 
sequences are highly biased in composition. Homologs which were not identified by library searches, but 
which were inferred from their collinearity with other conserved frames, are scored as P (positionally 
conserved). Listings scored as P? should be regarded as tentative at best. Listings with a question mark and 
a FastA score show borderline similarity in the absence of supporting evidence and should be regarded as 
speculative. In most cases the highest scores above 90 were listed. Compositionally biassed matches were 
excluded for the following frames: HCMV-TRL/IRL4, TRL/IRL13, UL32, UL44, and ULII3, 
Nomenclature for EBV, VZV and HSV-1 frames is conventional (Baer et al. 1984; Davison and Scott 
1986; McGeoch et al. 1988a); the EBV sequence designated as LP (leader protein) is translated from the 
spliced EBNA2 mRNA (Wang ct al. 1987) 



Frame 


HCM V 


— Homologs 
Score EBV 


Score 


Vy'V 


Score 


HoV 


Scon 


HCMVUL15 






Di^Kr Z: 


93 










HCMVUL25 


HCMVUL35 


IK 

ZJJ 










UL9? 


87 


HCMVUL35 


HCMVUL25 
















HCMVUL45 






OKJtkr £. 


151 


VZV19 


178 


UL39 


238 


HCMVUL46 






Dr\D trio 
DwKr I : 


P 


VZV20? 


P 


UL38? 


P 


HCMVUL47 


HCMVUL86? 




bULr i : 


P 


VZV21? 


P 


UL37? 


P 


HCMVUL48 






DDT Cl 

orLr I 


143 


VZV22 


P 


UL36 


144 


HCMVUL49 






brKri 


249 


VZV23 


P 


UL35 


P 


HCMVUL50 






Br Kr 1 ; 


P? 


VZV24? 


P 


UL34? 


P 


HCMVUL51 






DCD no 
orKr 1 r 


P? 


VZV25 


. 97 


UL33 


106 


HCMVUL52 






OCT CI 

Br Lr 1 


138 


VZV26 


179 


UL32 


207 


HCMVUL53 


HCMVUL69? 


95 


net CT 


263 


VZV27 


99 


UL31 


141 


HCMVUL54 


HCMVUL130? 


90 


BALF5 


343 


VZV28 


326 


UL30 


423 


HCMVUL55 






BALF4 


720 


VZV31 


1061 


UL27 


1052 


HCMVUL56 


HCMVUL1I2? 


95 


BALF3 


321 


VZV30 


290 


UL28 


323 


HCMVUL57 






BALF2 


352 


VZV29 


220 


UL29 


298 


HCMVUL6I 






LP? 


181 










HCMVUL69 


HCMVUL53? 


95 


BMLFl 


P 


VZV4 


P 


UL54 


127 


HCMVUL70 






BSLFI 


293 


VZV6 


302 


UL52 


405 


HCMVUL71 






BSRFl 


92 


VZV7? 


P 


UL51? 


P 


HCMVUL72 






BLLF2 


P 


VZV8 


P 


UL50 


88 


HCMVUL73 






Dl D CI 

t>LKr 1 


134 










HCMVUL75 


HCMVUL25? 


90 


BXLF2 


217 


VZV37 


P 


UL22 


P 


HCMVUL76 






BXRFl 


219 


VZV35 


151 


UL24 


132 


HCMVUL77 






BVRFl 


316 


VZV34 


278 


UL25 


291 


HCMVUL80 






BVRF2 


347 


VZV33 


177 


UL26 


243 


HCMVUL82 


HCMVUL83 


325 














HCMVUL83 


HCMVUL82 


325 














HCMVUL85 






BDLFl 


P 


VZV4I 


114 


UL18 


138 


HCMVUL86 


HCMVUL47? 


96 


BcLFl 


1876 


VZV40 


767 


UL19 


1225 


HCMVUL87 






BcRFl 


542 


VZV38? 


P 


UL21? 


P 


HCMVUL89 






BD/BGRFl 


1181 


VZV42/45 


1104 


UL15 


1206 


HCMVUL92 






BDLF4 


213 










HCMVUL93 






BGLFl? 


P 


VZV43? 


P 


UL17? 


P 


HCMVUL94 






BGLF2 


241 


VZV44 


P 


UL16 


P 


HCMVUL95 






BGLF3 


112 


VZV46 


P 


UL14 


P 


HCMVUL97 






BGLF4 


157 


VZV47 


112 


UL13 


97 


HCMVUL98 






BGLF5 


191 


VZV48 


78 


UL12 


140 


HCMVUL99 






BBLFl? 


P 


VZV49? 


P 


ULll? 


P 


HCMVULIOO 






BBRF3 


417 


VZV50 


224 


ULIO 


215 


HCMVULIOI 






BBLF2? 


P 


VZV51? 


P 


UL9? 


P 


HCMVUL102 






BBLF3? 


P 


VZV52? 


P 


UL8? 


P 


HCMVUL103 






BBRF2 


102 


VZV53 


91 


UL7 


121 


HCMVUL104 






BBRFl 


357 


VZV54 


375 


UL6 


309 


HCMVUL105 






BBLF4 


704 


VZV55 


642 


UL5 


598 



Analysis of the Protein-Coding Content 143 



Internal HCMV.related 
h FastA scores (Pearson 
nces are indicated only in 
\ significant, except when 
d by library searches, but 
scored as P (positionally 
> with a question mark and 
md should be regarded as 
illy biassed matches were 
32, UL44, and ULllB. 
984; Davison and Scott 
:in) is translated Trom the 



Frame 



HCMV 



— Homologs 
Score EBV 



Score VZV 



Score HSV Score 



Score 


HSV 


Score 




UL9? 


87 


178 


UL39 


238 


P 


UL38? 


P 


P 


UL37? 


P 


P 


UL36 


144 


P 


UL35 


P 


P 


UL34? 


P 


97 


UL33 


106 


179 


UL32 


207 


99 


UL31 


141 


326 


UL30 


423 


1061 


UL27 


1052 


290 


UL28 


323 


220 


UL29 


298 


P 


UL54 


127 


302 


UL52 


405 


P 


UL51? 


P 


P 


UL50 


88 


P 


UL22 


P 


151 


UL24 


132 


278 


UL25 


291 


177 


UL26 


243 


114 


UL18 


138 


767 


ULI9 


1225 


P 


UL21? 


P 


1104 


UL15 


1206 


P 


UL17? 


P 


P 


UL16 


P 


P 


UL14 


P 


U2 


UL13 


97 


78 


ULI2 


140 


P 


ULll? 


P 


224 


ULIO 


215 


P 


UL9? 


P 


P 


UL8? 


P 


91 


UL7 


121 


375 


UL6 


309 


642 


UL5 


598 



HCMVUL112 HCMVUL56? 
HCMVULI14 
HCMVUL116 
HCMVUL122 

HCMVUS2 HCMVUS3 
HCMVUS3 HCMVUS2 



95 



169 
169 



BKRF3 545 VZV59 461 UL2 489 

BDLF3? 128 

lEllO? 90 



survey of HCMV proteins including map assignments in the AD 169, Towne, and 
Davis strain genomes has been conducted previously by Landini and Michelson 
(1988). 



5 IE Genes 



The activation of IE genes is the initial step in a viral program of gene expression. 
Northern hybridization studies have shown that transcription from the HCMV 
genome during the immediate early phase of productive infection is limited to 
several discrete loci, with the most active region located near one end of UL 
(DeMarchi 1981; Wathen and Stinski 1982; McDonough and Spector 1983; 
Jahn et al. 1984; Wilkinson et al. 1984). This major immediate-early (MIE) region 
has been studied in several CMV strains, and unlike the bulk of the CM V genome is 
CpG suppressed (Honess et al. 1989). The MIE genes encode regulatory proteins, 
the expression of which requires only cellular factors, although virion components 
may also play a transactivating role (Spaete and Mocarski 1985a; Stinski and 
ROEHR 1985). More recently two other immediate-early loci have been sequenced 
and characterized in AD 169 (Kouzarides et al. 1988; Weston 1988). 



5.1 MIE Gene Region 

The first sequence data for this region were reported for HCMV Towne (Stenberg 
et al 1984) and showed the four-exon arrangement of the major immediate-early 
(IE 1) gene. Sequence analysis of the corresponding AD 1 69 region revealed a similar 
arrangement with minor differences. Only two changes were observed at the amino 
acid level (Akrigg et al. 1985). The organization of the equivalent murine CMV 
gene is grossly similar, but differs considerably at the sequence level (Keil et al. 
1987). Analysis of the HCMV IE promoter region exposed a complex array of 21-, 
19-, 18-, and 16-bp repeats upstream of the TATA and CAAT boxes (Thomsen et al. 
1984; Akrigg et al. 1985). The upstream sequence demonstrates a potent enhancer 
activity, detected by its ability to rescue enhancerless S V40 genomes (Boshart et al. 
1985). Homology with the core enhancer sequence TGGAAAG/TGGTTTG was 



144 M. S. Chee et al. 



noted in the 18-bp repeats and potential Spl -binding sites were also found, fhe 
enhancer binds cellular factors (Ghazal et al. 1987, 1988) and dissection has shown 
that the 19-bp elements can mediate cAMP induction (Fickenscher et al. 1989; 
HuNNiNGHAKE et al. 1989). Similar enhancers were also found in murine and simian 
CMVs (Dorsch-Hasler et al. 1985; Jeang et al. 1987). Nuclear factor I binding 
sites are associated with the enhancer region in both human and simian CMVs 
(Hennighausen and Fleckenstein 1986; Jeang et al. 1987). 

Stinski et al. (1983) recognized two further IE regions beginning immediately 
downstream of lEl. The IE2 region has more recently been called IE2a and a further 
region recognized as IE2b (Hermiston et al. 1987; Stenberg et al. 1985). Under 
immediate-early conditions, transcription of the IE2a region starts mainly from the 
IE I promoter and a set of alternatively spliced transcripts is produced. In the 
predominant species the IE2a exon (HCMV-UL122 in AD169) is fused to the first 
three exons of lEl. HCMV-UL122 encodes 494 amino acids following the splice 
acceptor. This is in agreement with the size predicted of the IE2a exon reported for 
the Towne strain by Pizzorno et al. (1988). A 1.7-kb unspliced mRNA can also 
originate from a promoter proximal to the IE2a frame (which also contains a Kozak 
consensus ATG; Kozak 1981). This transcript is more abundant at early and late 
times postinfection (Stenberg et al. 1985). The product of the IE2a frame may be 
involved in autoregulation (Pizzorno et al. 1988). A minor transcript extending into 
the IE2b region has been diagrammed (Hermiston et al. 1987). We are unable to 
correlate this with the AD 1 69 sequence using the available information. However, a 
potential splice donor occurs before the UL122 termination codon, and a poly A 
signal at position 167 503 is consistent with the predicted end point of the Towne 
transcript. It is likely that the reading frames on either side of this signal, ULl 19 and 
ULl 18, are spliced together to encode a membrane glycoprotein. 



5.2 HCMV US3 IE Gene 

Sequencing of the US region of HCMV revealed an enhancer element containing 
five l8-bp repeats with homology to the MI E 18-bp repeats and the core enhancer 
element (Weston 1988). These repeats were located in the region -80 to -270 of an 
RNA cap site in the HCMV-US3 (HQLFl) gene. In the region -340 to -600 a further 
set of six novel 11 -bp repeats was found. A 275-bp fragment containing the 18-bp 
repeats enhanced expression in an orientation-independent manner in HeLa cells, 
with an efficacy equivalent to the SV40 enhancer (Weston 1988), while the MIE 
enhancer 18-bp repeats have recently been shown to be involved in positive 
autoregulation by lEl (Cherrington and Mocarski 1989). The significance of the 
ll-bp repeats is unknown. However, a hexanucleotide consensus (TRTCGC) 
derived from these repeats was noted to occur in the MIE enhancer (Weston 1988). 
Transcription from the HCMV-US3 reading frame associated with the enhancer is 
highly active at IE times and produces a set of differentially spliced transcripts. The 
protein-coding sequence of HCMV-US3 contains signal, anchor, and N-linked 
glycosylation sequences, is homologous to HCM V-US2 (HQLF2), and may also be 
related to the RLII and US6 gene families (Sect. 8). 



Analysis of the Protein-Coding Content 145 



ere also found. The 
lissection has shown 
ENSCHER et al. 1989; 
a murine and simian 
ear factor 1 binding 
I and simian CMVs 

ginning immediately 
xlIE2a and a further 
; et al. 1985). Under 
arts mainly from the 

is produced. In the 
}) is fused to the first 

following the splice 
2a exon reported for 
ced mRNA can also 
iso contains a Kozak 
ant at early and late 

IE2a frame may be 
iscript extending into 
7). We are unable to 
»rmation. However, a 
codon, and a polyA 

point of the Towne 

is signal, ULl 19 and 

tein. 



r element containing 
nd the core enhancer 
ion -80 to -270 of an 
-340 to -600 a further 
containing the l8-bp 
lanner in HeLa cells, 
988), while the MIE 
involved in positive 
he significance of the 
)nsensus (TRTCGC) 
ancer (Weston 1988). 
1 with the enhancer is 
diced transcripts. The 
nchor, and N-linked 
.F2), and may also be 



53 UL37 IE Gene 

A second UL IE transcription unit was identified in the region of the AD169 Hindlll 
J and Z fragments (Wilkinson et al. 1984). The sequence of this region together with 
mapping data for three mRNAs has been published (Kouzarides et al. 1988). A 3.4- 
kb IE transcript was shown to be spliced from four exons and, like HCMV-US3, 
encodes a potential glycoprotein. This mRNA is 3' cotcrminal with a l.65-kb 
transcript which can be detected in the IE phase but is more abundant at the late 
stage of infection. The predicted product of the 1.65-kb mRNA is a member of the 
US22 homologous protein family (Sect. 7.2). A 1.7-kb transcript utilizing the same 
promoter as the 3.4-kb mRNA is most abundant at IE times but can also be detected 
late in infection. Of the mapped transcripts only this RNA contains the HCMV- 
UL38 (HZLF3) reading frame. However, expression of UL38 from this transcript 
would require the upstream UL37 exon 1 to be bypassed; alternatively, the frame 
may be read from an uncharacterized low-abundance transcript (Kouzarides et al. 
1988). A 40-kDa protein synthesized in vitro from HiVidlll Z or J hybrid-selected 
mRNA is consistent with translation from UL38 (Wilkinson et al. 1984). Although 
a slightly longer reading frame completely overlaps UL38 on the opposite strand 
(UL38X, not shown), analysis of third position G + C contents suggests that of the 
two opposing frames UL38 is more likely to be coding (84.3% vs 62.8% G -h C). 



6 Early and Late Genes 



Immediate-early proteins are required to activate genes which establish the early or 
delayed early (E or DE) phase of infection, the outcome of which is the replication of 
the viral genome. Late genes are expressed at high levels after DNA replication and 
are likely to encode most of the structural and assembly proteins of the virus. The 
distinction between E and late phases is blurred for some genes, and is further 
complicated by posttranscriptional regulation of gene expression (DeMarchi 1983; 
Geballe et al. 1986a; Goins and Stinski 1986). In the following sections we attempt 
to correlate the available information on E and late genes with our sequence data. 
The organization of the following sections superficially resembles the viral timetable 
as convenient, but may be similarly inscrutable in places. 



6.1 Major Early Transcripts 

The most abundantly transcribed region of HCMV at early times postinfection is 
situated in the long repeats of the virus and encodes a 2.7-kb transcript of unknown 
function (Greenawav and Wilkinson 1987; Hutchinson et al. 1986; McDonough 
et al. 1985). An eariy transcript of similar size also originates in RL of HCMV Towne 
(Wathen and Stinski 1982), one copy of which can be deleted without compromis- 
ing viability in cultured human fibroblasts (Spaete and Mocarski 1987), 



146 . M. S. Cheftt al. 



Green A WAY and Wilkinson (1987) determined a 6220-bp sequence in HCMV 
AD169 which encompasses the gene for the 2.7-kb transcript Their sequence is 
equivalent to positions 1635-7859 of Fig. 3 viewed in the opposite orientation. (We 
refer only to TRL sequence positions for clarity.) It contains two ambiguities and 
differs from our sequence at nine positions. However, only one of these is located 
within the major early transcription unit; the doublet CC beginning at position 3386 
of Greenaway and Wilkinson (1987) is a triplet in our sequence. The open reading 
frame corresponding to the predicted translation product of the major 2.7-kb 
transcript as mapped by these authors is TRL/IRL4. The translational start is 
suggested to be the fourth ATG from the start of the transcript and occurs at 
position 4294 in our sequence. This is not a Kozak ATG in that it does not have a 
purine at -3 or a G at -1-4 (Kozak 1981, 1982). However, two upstream ATG 
codons fit the Kozak consensus. The first has the sequence CGGATGG and is 
followed by a stop codon after seven amino acids. The second has the sequence 
GAGATGA and begins a 35-amino-acid reading frame. These codons have been 
shown to inhibit translation from a downstream AUG and may therefore be cis- 
regulatory signals (Geballe et al. 1986a; Geballe and Mocarski 1988). Upstream 
Kozak consensus ATGs precede a number of other HCMV genes, and suggest a 
general phenomenon in HCMV translational regulation. However, this role has yet 
to be demonstrated directly and so far no products have been found for the major 
early transcript. A less-abundant 2.0-kb transcript has been mapped immediately 
downstream of the 2.7-kb transcript in the Eisenhardt strain of HCMV 
(Hutchinson etal. 1986). The predicted polyadenylation site is conserved in 
AD 169, beginning at position 6552 in our sequence. However, a similar-sized 
transcript was not detected (McDonough et al. 1985). It is also not possible to 
suggest a 5' end from the Eisenhardt strain restriction map data. There are, however, 
no reading frames that might obviously be utilized in this region with the exception 
of TRL/IRL6. A minor 1.3-kb immediate-early RNA and a 1.2-kb late RNA have 
also been mapped to this general region (McDonough et al. 1985; Hutchinson 
et al. 1986); the latter is detected at early times postinfection but is most abundant in 
the late phase. The poiyA signal for this message was located precisely in the 
Eisenhardt strain and begins at position 6365 of our sequence (Hutchinson et al. 
1986). These authors also mapped the start of the transcript by nuclease protection 
and found no evidence for splicing. Further mapping and sequencing studies, the 
latter performed on genomic as well as cDNA clones, were used to predict a coding 
frame of 254,amino acids within the transcript (Hutchinson and Tocci 1986). The 
region sequenced corresponds to positions 6300-7468 of Fig. 3 (displayed in the 
IRL orientation). However, in AD 169 the 254-amino-acid reading frame is 
disrupted by three stop codons and two frameshifts relative to the Eisenhardt 
sequence and is identical in both repeats. Our data and those of Greenaway and 
Wilkinson are in agreement for the region spanned by the putative reading frame. 
We are unable to predict a reading frame which may be translated from this message 
in AD 169. The first Kozak ATG occurs 164 nucleotides downstream of the 
transcription start predicted by Hutchinson and Tocci (1986), but is followed by a 
stop codon after 42 intervening amino acid codons. Furthermore, although 
TRL/IRL7 is located in this message, it is over 500 bp from the predicted start. If 



Analysis oi me rroiein-^^ouing conieni i**/ 

these differences between the Eisenhardt and AD 1 69 strains are genuine, sequencing 
from other strains would be useful in assessing their biological relevance. 



6.2 Enzymes of Nucleotide and DNA Metabolism 

6,2.1 Nucleotide Metabolism 

HoNESS (1984) postulated that differences in overall base compositions between 
herpesvirus genomes reflect the ability of the viruses to modulate and utilize the 
nucleotide pool available for DNA synthesis. This hypothesis appears to be borne 
out in the case of the two closely related a-herpes viruses, HS V- 1 and VZV. The latter 
is AT rich and encodes a thymidylate synthase, which does not have a homolog in 
the G + C rich HSV-1 genome (Thompson et al. 1987; McGeoch et al. 1988a). A 
parallel exists in the less closely related ^-herpesviruses Epstein-Barr virus (EBV) 
and herpesvirus saimiri (HVS); the latter A + T rich virus encodes thymidylate 
synthase and dihydrofolate reductase, which both seem to be absent from the G + C 
rich EBV (Honess et al. 1986; Trimble et al. 1988; Baer et al. 1984). All four viruses 
also encode deoxyribonucleoside kinases, and hence can utilize the salvage 
pathway of dNTP synthesis (McKnight 1980; Davison and Scott 1986; Littler 
et al. 1986; Gompels et al. 1988a). These enzymes differ in their substrate specificity 
and their main role might be to allow the exploitation of specific cell types, such as 
may occur in latency. Genes for ribonucleotide reductase, a key enzyme in 
deoxyribonucleotide synthesis, have been found in HSV, VZV, and EBV as well as 
other herpesviruses, but have not so far been identified in HVS (Gibson et al. 1984; 
Davison and Scott 1986; Nikas et al. 1986), The HCMV genome is relatively 
G + C rich (Fig. 2) and it will be of interest to determine if its complement of enzymes is 
consistent with the theory of Honess (1984), HCMV does not appear to encode a 
thymidine (deoxyribonucleoside) kinase (TK); the position in the AD 169 genome 
equivalent to the TK locus in other herpesviruses is deleted relative to the other 
herpesviruses (Fig. 3). However, HCMV is sensitive to the nucleoside analog 
DHPG, and a resistant mutant of AD 169 has been isolated which accumulates less 
of the triphosphate form of the drug (Biron et al. 1986). This may indicate that a 
deoxyribonucleoside kinase is encoded at some other locus. 

The partial conservation of a ribonucleotide reductase (RR) homolog is more 
puzzling. Mammalian cells contain an iron-tyrosyl radical enzyme, which is the type 
found in herpesviruses (Sjoberg et al. 1985; Reichard 1989). The enzyme has an 
a2^2-structure; the HCMV-UL45 gene product is homologous to the a-(large) RR 
subunit, and HCMV-UL45 is positionally conserved with the gene for this subunit 
in other herpesviruses. However, the gene for the jS-(small) subunit does not appear 
to be conserved; HCMV-UL44 is positionally analogous to the small RR gene in 
other herpesviruses but encodes a set of late DNA-binding proteins (see Sect, 6.5). 
The small subunit contains the active tyrosyl radical and would be essential for 
function. Thus it is not clear at present if HCMV is capable of expressing a fully 
active ribonucleotide reductase. Although we have used loosely defined motifs to 
search all the predicted reading frames for a potential active site, no obvious 



^ ■ \ 

148 M. S. Chee% al. 

candidates were identified. Several explanations could account for this. For 
example, if HCM V-UL45 is functionally conserved v^ith the large subunit, it might 
usurp the place of its cellular counterpart which mediates ailosteric control as well 
as being involved in catalysis. Herpesviral reductases appear to be unregulated, 
indicating that the function is either unnecessary or perhaps detrimental in the viral 
context (Laniken et al. 1982; Avertt et al. 1983). It is also possible that synthesis of 
one or both of the cellular subunits is upregulated during viral infection (Stinski 
1977). The genes for the human RR subunits are unlinked; the a-subunit gene is on 
chromosome 11 (Engstrom et al. 1985), and the j3-gene on chromosome 2 (Yang- 
Feng et al. 1987). Finally, it is worth mentioning that another key ailosteric enzyme 
of nucleotide metabolism is dCMP deaminase; this enzyme converts dCMP to 
dUMP, which is the substrate for thymidylate synthase. Hence it might be an 
appropriate enzyme for herpesviral repertoires, particularly those which have 
devolved to an A + T bias. 

6.2.2 DNA Replication 

A set of seven HSV-1 genes has been shown to be essential for the replication of an 
HSV-origin-containing plasmid (Wu et al. 1988; McGeoch et al. 1988b). The 
HCMV homologs of four of these have been identified bv sequence analysi s. 
HCMV-UL54 encodes the DNA polymeras e (Kouzaripes et al. 1987a; Heilbronn 
et al. 1987) and HCMV-UL57 the major DNA-bindinRj arotein (MDBP). The latter ^ ^^'"^ 
sequence shows 72% identity over a length of 1160 aligned amino acids to the 
MDBP of simian CMV (Coibum) (Anders and Gibson 1988; Anders and Gibson, 
personal communication). HCMV-UL105 encodes a homolog to HSV-UL5, which 
is probably a helicase enzyme (Crute et al. 1988, 1989). Helicases belong to a 
superfamily of proteins with functions in replication and/or recombination 
(Hodgman 1988). A nucleotide-binding site in UL105 (Martignetti 1987), of the 
type GxxGxGK (where x = any amino acid), is common to the other members of the 
superfamily. HCMV-UL70 is t he fourth HCMV gene with an obvious replicatio n 
genc_co unterpart, in HSV^ ULSl The product of HSV-UL52 is part of a helicase- 
primase complex in HSV-1 -infected cells which also contains the HSV-UL5 and 
UL8 proteins (Crute et al. 1989). HCMV genes UL102 and ULlOl are positionally 
equivalent to HSV-UL8 and UL9 respectively, although they show no clear-cut 
homology. However, HCMV-UL102 is a similar length to HSV-UL8 (798 and 750 
residues respectively). HSV-UL9 encodes an origin-binding protein (Olivo et al. 

1988) , and the positive identification of its HCMV counterpart may require the 
identification of an HCMV origin of replication. 

6.2.3 DNA Repair 

The gene for uracil-DNA glycosylase, which is involved in base excision repair, was 
identified in HSV-2 and is conserved in the sequenced herpesviruses (Worrad and 
Caradonna 1988; Baer et al. 1984; Davison and Scott 1986; Mullaney et al. 

1989) . The corresponding HCMV-reading frame is HCMV-ULl 14, which is the last 
frame at this end of UL with detectable homology to sequenced human herpes- 



Analysis of the Protein-Coding Content 149 



viruses. A dUTPase gene is also conserved in herpesviruses, albeit less well than 
uracil-DNA gly cosy lase (Preston and FtsHER 1984; Davison and Scott 1986; Baer 
et al. 1984). The HCMV homoiog is HCMV-UL72. 

6.2A Deoxyribonuclease 

A deoxyribonuclease gene found in HCMV appears to be ubiquitous in herpes- 
viruses, as homologs are found in HHV-6 (Lawrence et al., unpublished results), 
EBV (Zhang et ai. 1987), HSV (McGeoch et al. 1986), and VZV (Davison and 
Scott 1 986). The role of this enzyme is currently unknown, but it may be involved in 
cleavage of viral concatemers and/or the processing of genome termini (Chou and 
RoizMAN 1989). 



63 Phosphotransferase 

The putative phosphotransferase encoded by HCMV-UL97 is conserved in the 
human herpesviruses and distantly related to the protein kinase family (Chee et al. 
1989a; Smith and Smith 1989). Interestingly, some of the most conserved amino 
acids in protein kinases are variant in the herpesvirus sequences. One motif where 
these differences occur is shared with bacterial phosphotransferases, which vary at 
the same amino acid positions as do the herpesvirus proteins (Brenner i 987). Hence 
it remains to be shown if HCMV-UL97 and its homologs are in fact conventional 
kinases. Whatever its specific role, the preservation of this gene in all of the 
recognized herpesvirus lineages and HHV-6 implies an important or indispensable 
contribution to the viral life cycle. None of the other HCM V-reading frames we have 
screened have detectable homology to known protein kinase motifs, which are seen 
in the a-herpesvirus US-encoded kinases (McGeoch and Davison 1986). 

6.4 Early Phosphoprotein Genes 

The gene for a set of phosphoproteins sharing a common N-terminus has been 
mapped by Wright et al. (1988). These authors mapped the termini of two spliced 
12-kb early transcripts, raised an antiserum against a synthetic peptide predicted 
from a 5'-terminal portion of the 5'-exon sequence (Kouzarides et al. 1983; 
Rasmussen et al. 1985a) and used this to detect four proteins of 34, 43, 50, and 
84kDa in infected cells (Wright et al. 1988). Pulse-chase experiments did not 
suggest that any of the proteins were derivative in nature. Although the mapping 
data are as yet incomplete, it would thus appear that all four proteins are coded in 
alternatively spliced mRNAs sharing a 5' exon. This exon corresponds to ULl 12 in 
our sequence. A 279-bp portion of the UL 11 3 frame (positions 161 503-161 781) is 
flanked by potential acceptor and donor sites, and may correspond to a 280-bp exon 
mapped by Staprans and Spector (1986). The downstream exons may also be 
derived from ULl 13, which extends to position 162797. A polyA signal begins at 
position 162909, but there is an alternative poly A sequence coinciding with the end 



150. M. S. Che^'et al. 

of UL113 (ATTAAA, beginning at position 162796). It therefore seems likely that 
one or both of these signals indicates the end of the transcription unit. The four 
proteins were found to be predominantly contained in the nuclear fraction of 
infected cells, and were not shown to be virion structural proteins in preliminary 
studies (Wright et al. 1988). 



6^ Late DNA-Binding Proteins 

Mocarski and coworkers utilized immunological screening of a Agtll expression 
library to map a group of proteins known as the ICP36 family to the HCMV-UL44- 
reading frame (Mocarski et al. 1985; Leach and Mocarski 1989). The ICP36 
proteins gravitate to the nucleus, include phosphorylated and glycosylated species, 
and are DNA-binding proteins (Pereira et al. 1982; Gibson 1983; Mocarski et al. 
1985). Regulation of HCMV-UL44gene expression is manifested in both early and 
late transcription from different TATA boxes, and delayed translation of early 
message (Leach and Mocarski 1989; Geballe et al. 1986b). The significance of this 
complex control is unclear, although it is interesting that the 3'-end of the reading 
frame is overlapped by a gene encoding a small RNA in the same orientation. This 
gene is probably transcribed by RNA polymerase III (Marschalek et al 1989). 



6.6 Capsid Proteins 

The gene for the major capsid protein (MCP) was identified by sequence homology 
to the MCP sequences of other human herpesviruses and the assignment confirmed 
immunologically (Chee et al. 1989b). The MCP is encoded by the HCMV-UL86 
reading frame. Homology searches show that the predicted protein sequence of 
another frame, HCMV-UL47, is s imilar to a region of the human herpesvirus majo r 
capsids corresponding approximately to positions 1080-1 170 of Fig. 3 in (Chee 
eraLl989b). Although this match may be fortuitous, the alignment of HCMV- 
UL47 to conserved capsid sequences makes it of interest. However, the sequence is 
not obviously conserved in the EBV, VZV, and HS V- 1 reading frames collinear with 
HCMV.UL47. 

A second capsid protein, which is a constituent of incomplete capsids, has been 
mapped in the UL region of three CMV strains (Robson and Gibson 1989). Several 
lines of evidence implicate this protein in DNA packaging and/or capsid assembly 
(Preston et al. 1983; Irmiere and Gibson 1985; Lee et al. 1988; Rixon et al, 1988). 
The gene for the putative assembly protein is conserved in the human herpesviruses, 
and is predicted to encode proteins of 635, 605, 605, and 708 amino acids in HSV, 
VZV, EBV, and HCMV respectively (McGeoch et al. 1988a; Davison and Scott 
1986; Baer et al. 1984) (Table 1). The sequence of a 1-kb cDNA derived from the 
Colbum strain of CMV shows homology only to the 3' half of HCMV-UL80, 
consistent with the 37-kDa size of the Colburn strain assembly protein which is 
probably processed at the carboxy terminus (Robson and Gibson 1989). A larger 
transcript of 1.8-kb is also encoded at this locus. The 5' portion of the HCMV-UL80 



Analysis ol the Protem-Coding Conlent 1 :> I 



frame is conserved in the other sequenced human herpesviruses. It thus seems likely 
that at least two seperate proteins are encoded by HCM V-UL80, with a TATA box 
at position 1 15992 being used to produce the assembly protein transcript (Robson 
and Gibson 1989). This TATA box is located within 1 5 bp which are identical in 
Colburn and AD 1 69 (Necker et al. 1 988 cited in Robson and Gibson 1 989). It is also 
noteworthy that the ATG downstream of this TATA box does not fit the Kozak 
consensus in either of the two CMV sequences. In contrast to the major DNA- 
binding protein (Sect. 6.2.2), the sequences for the putative assembly protein are 
quite divergent. The Colburn sequence from the first methionine of the predicted 
cDNA reading frame exhibits approximately 40% identity to the carboxy-terminal 
371 amino acids of HCMV-UL80. 



6,7 Structural Phosphoprotein Genes 

HCMV virions contain three main phosphoproteins which appear to be lo.cated in 
the virion tegument (Roby and Gibson 1986). The largest of these is approximately 
150kDa in size, constitutes approximately 20% of virion protein content (Irmiere 
and Gibson 1983), and is also modified by O-linked glycosylation (Benko et al. 
1988). A 6360-bp region containing the ppl50 gene sequence (which corresponds to 
the reading frame HCMV-UL32) has been published and spans positions 37 157- 
43 5 1 6 of Fig. 3 viewed in the opposite orientation. A late 6.2-kb mRN A was mapped 
in this region, and its termini delineated. Some processing at an alternative polyA 
site (ATTAAA) downstream of the orthodox signal was demonstrated. The major 
RNA species is predicted to encode ppl 50 although a range of smaller RNA species 
was also detected (Jahn et al. 1987). 

The two other major phosphoproteins located in virions are pp7 1 and pp65, also 
known as the upper and lower matrix phosphoproteins respectively. The 65-kDa 
phosphoprotein is also glycosylated (Clark et al. 1984; PandecI al. 1984), and pp71 
may be similariy modified. The genes for pp65 and pp71 are located in the //mdlll L, 
c, b region of the genome and correspond to reading frames HCMV-UL83 and 
UL82 respectively. The sequence of a HindUlfBglU fragment containing these genes 
has been reported, and corresponds to nucleotides 1 17 276-121 377 of Fig. 3 viewed 
in the opposite orientation (Ruger et al. 1987). The published sequence is in error; 
position 2 1 2 ( 1 2 1 1 66 in the genome) is shown as a G but should be read as a C. This 
change does not affect the predicted coding sequences. Two transcripts which 
appear to be 3' coterminal were mapped in this region. They are an abundant 4-kb 
mRNA and a low-level 1.9-kb mRNA. The 5' ends of both transcripts have been 
located, but surprisingly no TATA box is proximal to the 4-kb transcription unit 
(Ruger et al. 1987). The 4-kb message should encode pp65, while the shorter mRNA 
would allow pp71 to be translated. The mRNA encoding pp65 (ICP27) in HCMV 
Towne appears to be produced efficiently both eariy and late in infection, but is not 
translated at high levels until the late phase (Geballe et al. 1986b; but see Depto 
and Stenberg 1989). The gene sequences for two further structural phosphoproteins 
have been reported (Meyer et al. 1988; Davis and Huang 1985). The data of Meyer 
et al. (1988) represent positions 143 791-145 1.91 of our sequence in the HindlU R 



152 M.S. ChRelai. 



u .2 



4> « 



5 & 



.so 



o 

.2 

So 

^ 00 

CO 



^2 S 



s 5 



— c 



c 73 J 

cs ^ J 



• 2 s i ^ - 



a ^ G {tfUt 



^ CO 



es ^ O 



2 S 



"u. 5 S 

0 -5 "5 a,-- 

- S 2 

^ -C — is 

1 b « J ^ 



O o 

5 ^ S o I- 



M C C U ^ 

or 3 •£ ^ DO 

-g in * i « 
? oo 5 ^ 

«^ =^ c 2 



« 3 « 

3 « X 



5-! 



|il|8 



.2 8 S S 
w S o 



o. 

82- 

> E o, o 
5 « " 



1 



.50 



. ^ 2 £ 
.2 0.0 
« 5 *i o 



< 
= > 



= 55 
5 2 J 

_J I 
-J > ^ 

> > > 



^ > 

J < = s ^ 
H < o E 2^ 

-J J C ^ 



5^2 




^ < -J ::j 
o r* -J 



f-« ^ t_ 



< H 

> CO > — < 
CO > CO o > 

<>HUU 



Z u. < 
>- -1 ^ 
ofi >■ > 



si- 

tP 

CO J J 

o <2 

> > a- 

>• J J 

S Tin 
< < 



> ^ 



< 
< 

CO 
CO 

CO 



^ fO OS ^ 



O 



> 
> 
a. 

< 
> 
a. 

u 

H 
2 
U 

cy 
o 




•J > CO U 
U- 05 < 

>- >- Cotf 



o 

u. 
> 

> 
u 
>- 
< 
< 

> 
•J 

H 

> 
2 



> 

H 

Q£ 

CO 

ac 
H 
Z 
-J 
U 
X 
H 
> 

> 
< 

z 

< 
X 



> 

u. 
u. 

a 
g 

-J 

> 
> 



ttou 

^ ^ X 

^ ^ 

OS $ 

X ? «s 

>- u. 5 
a:> 5 
J J ? 
^ ^ ^ 



CL CO 3* Q 

Q (J > J 



* 1^ I 

uu ri > 

C/3 ^ ^ S 

o! a. H > 
J > -J J 
u. (- U s 

«S22 

> uj S ^ 
2222 



— > 

CO H 
U. 0^ 
>- cu 

u. 2 
< > 

CO o 

z 2: 

CO 
>^ 

wZ 

2S 



I — rr> rn O 



00 

CO CO -J 
CO g J 

> on u. 



£2 



u ^ ^ ^ 

ills 

uj > uj O 

222s 



CJCJUUCJOU 



o — 
m — 
^ .J ^ 

a: OS oe: 

222 

XXX 



-3 



a: OS a: : 

222 
XXX 



>>>>>>>>>>>>>>>>s>>>>>>>> 

S2222S2SSSS2SSS2>22S2S22S 

UUUUUUOUUUUUOUOUUUUUUUCJUU 
XXXXXXXXXXXXXXXXXXXXXXXXI 



mtai^M^ ui uic riuiciii-s.imiiti$ v.uiuciu 



< 

> > 
-J (- 
-J > 

> > 

< ^ 
<U. 

>£ 

CO U. 

i- 



C < J 2 
i= CO O J 
>UHO 

^ J H U- 

Q 0- 
>. oi 

< f- 



u 
u 
u 

CO Q 

II 

CO J 

U. U 

Eo^ 

O «5 
H > 

> u, 

> X 

u 

UJ 



> 

> t 

i 

O 
X 



JO 

i 

ii 

> < 

n -J 

0£ 



OS (- ^ 



P 



^ S 

-J ^ H 

>0>i 

x>2^ 



oo -n- o — 



a 

CO 

X 

CO 

> 
> 

> 

CU 

flu 



O > 

flu CO 



> i H 
>2co 

:t < 5 
-J S < 

> D, J 

> J U 

> > J 

> = cy 
H oc o 

>■ ? Ul 



5q 
-I 

I 

X h- 

£: < 
3o 

^ flu 
oc < 
^ flu 

S2 



G 

OS 

flu 

-I 

IS 
it 



fltf < CU 

^ o 
ctf Q 5 

c>x 

0- a: < 

J > > 

ti- < o 
o> > 
-J _1 

Q Q£ Q£ 



ill? 

^ '•^ > >i 

u. J ^ O 
il. -J J 



OUCJUU UUUOUUCJUUOU 



W V* W ^ »^ 

w^ — — <NrNi<N(Npnm 
r* — — — — — ^rvim^or^ 



X X X ; 



04 






O — rr 


CO 


CO CO 


CO 


CO CO W3 OQ CO 










> 


> > 


> 


> > > > > 










o 




u 


U UU u u 


X 


X X 


X 


xxxxx 



I J** IVI. o. V^W^ ci at. 

fragment and show the gene for a 28-kDa protein encoded by a late 1.3-kb RNA. 
Martinez et al. ( 1 989) and Martinez and St. Jeor (1986) mapped a 25-kDa protein 
to the same locus and assigned a 1.6-kDa late mRNA as the message. These RNAs 
are likely to be initiated from one or both of two TATA boxes proximal to HCMV- 
UL99. An HCMV Towne 1.4-kb late mRNA localized to this region may also 
denote HCMV-UL99 (Pande et al. 1988). However, the Towne protein migrates as 
a 32-kDa protein. If the same frame is in fact being used, nontrivial explanations for 
the difference could be invoked at the genetic, transcriptional, and protein- 
processing levels. It is interesting to note that a minor 27-kDa species was detected 
by Pande et al. (1988) in infected cells and virions. 

An example of a phosphoprotein gene that appears not to be conserved between 
HCMVs Towne and AD169 was mapped and sequenced from passage 36 of HCMV 
Towne (Davis et al. 1984; Davis and Huang 1985). This gene encodes an abundant 
late transcript, and immunological evidence suggests that its product is a 67-kDa 
nonglycosylated phosphoprotein found in virions. The sequenced fragment corres- 
ponds very approximately to a region of AD169 HindlU D beginning at about 
position 95 500. There appear to be significant differences between the two genomes 
in this region. These include numerous point and frameshift mutations and a 
deletion of 61 bp in Towne relative to AD169. A consequence of some of these 
differences is the disruption of the putative Towne reading frame in AD 1 69, 
although a portion of the predicted phosphoprotein sequence is preserved in 
HCMV-UL65. The reported sequence was not determined fully on both strands, 
and not all sequenced fragments were shown to be contiguous. Hence further 
comparative sequence analysis and transcript mapping will be necessary before 
these findings can be interpreted unambiguously, particulariy as the equivalent 
region in ADi69 contains some potential splice sites. A gene which is posttranscrip- 
tionally regulated by an mRNA 3'-end processing event was partially sequenced and 
shown to contain a potential stem-loop structure (Coins and Stinski 1986). This 
sequence maps to positions 96753-97076, and may therefore correspond to the 3' 
end of a transcription unit spanning HCMV-UL65. The putative stem-loop 
structure in the Towne sequence is conserved in AD 169, although there arc three 
deletions relative to AD 169 clustering in the 3'-terminal 25 nucleotides of the 
published sequence. 

6.8 Surface Glycoproteins 

The importance of glycoproteins as surface antigens has made the major HCMV 
glycoproteins a focus for characterization and functional studies. A total of 54 
reading frames have now been found in the sequence that have charctcristics of 
glycoprotein genes or of exons of glycoprotein genes. These are presented in Table 3, 
which shows the predicted signal sequences, the number of N-linked glycosylation 
sites, and the anchor sequences. Twenty-two of these frames lack either a signal or an 
anchor. In the following sections we consider two immunologically important 
glycoproteins, and two which have homology to host immunoglobulin superfamily 
proteins. Known IE glycoprotein genes and glycoprotein gene families are 
considered separately in Sects. 5 and 7 respectively. 



\ ■ - 

\ 

\ 

Analysis of the Protein-Coding Content 155 

6.8.1 Glycoproteins B and H 

There are seven virion glycoproteins encoded by HSV-1 and one putative 
glycoprotein {US5) predicted from the sequence (McGeoch et al. 1988a). Of these 
five have counterparts in the sequence of VZV (Davison and Scott 1986) and only 
two in the genome of EBV (Baer et al. 1984). In addition,. EBV has the gp350/220 
(BLLFla,b), BILFI, and BLRFl glycoproteins. The latter has a homolog in 
HCMV-UL73. Of the other herpesvirus glycoproteins, only homologs to gB 
{HCMV-UL55) (Cranage et al. 1986; Kouzarides et al. 1987b; Mach et al. 1986) 
and gH(HCMV-UL75) (Cranage et al. 1988; PACHtet al. 1989) have been found in 
the HCMV sequence, and so gB and gH are common to all of the well-studied 
herpesviruses. The conservation of gH in distantly related herpesviruses (Compels 
elal. 1988b) and the production by an HSV-1 ts mutant of noninfectious virus 
lacking gH (Desai et al. 1988) underpin the substantial body of immunological 
evidence that gH is essential for virus infectivity. Monoclonal antibodies tt) HCMV 
gH can neutralize virus in vitro unassisted by complement (Rasmussen et al. 1984; 
Cranage et al. 1988). Antibodies to gB are also able to neutralize virus in vitro, but 
require complement (Cranage et al. 1986), A virion envelope glycoprotein complex 
has been shown to contain gB, but the structural nature of this entity awaits 
definition (see, for example, Farrar and Green a way 1986; Gretch et al. 1988a). 
The unmodified gB precursor in AD 169 is predicted to be 102 kDa in size. This is 
processed and glycosylated to give a 145-kDa species which is proteolytically 
cleaved to produce a 55-kDa species, both of which can be detected in infected cells. 
However, the residual 90-kDa amino-terminal cleavage product is not detected 
(Cranage et al. 1986). The site of cleavage has been mapped to Arg45o ^^e gB of 
HCMV Towne and by analogy processing of the AD 169 gB is likely to occur after 
Arg459 (Spaete et al. 1988). These authors also compare the gene and protein 
sequences of gB and find identities of 94% and 95% respectively between the two 
HCMV strains. (A similar level of conservation is found between the gH sequences of 
these strains; Pachl et al. 1989.) There appear to be noteworthy differences in the 
kinetics of gB transcription in these two strains. The AD169 gB transcripts are 
produced late in infection (Kouzarides et al. 1987b) while the Towne gB mRNA is 
of the early class. However, in HCMV Towne infected cells gB is not detected 
immunologically until late in infection (Rasmussen et al. 1985b), implying that the 
two strains might use different strategies to achieve a similar result in the regulation 
of gB expression. 

6.8.2 HLA Homolog 

The identification of an HCMV glycoprotein with homology to class I major 
histocompatibility (MHC) antigens has implications for host-virus interactions 
(HCMV-UL18, Beck and Barrell 1988). The crystal structure of a human class I 
histocompatibility molecule (HLA-A2) has been solved (Bjorkman et al. 1987a), 
making it possible to predict that the HLA homolog is likely to have three 
extracellular domains analogous to the class I al-, a2-, and a3-domains. The latter 
contains a ^2-microglobulin (^2f^)-t>inding loop which is partially conserved in the 



156 M. S. Che#fet al. 



A uadbci I 19881 In cellular HLA molecules, the a3- 
HCMV sequence BECK and "-J 4;XicLL^ 

domain and assoaated ^.m are both ^ between these 

anda2-domains which each contamaonga-heh^^^^^^ .^^^^^^ 

helices forms - ^^^r ^C^^^^^^^ ^ \ 

bmdmg to a T'^" /^^P °^ ^ ' a2-domains in the HCMV homolog are 
cellular sequences, both the al- „ .^.-i ten NXS/T motifs. Three 

potentially heavily glyco^^^^^^^^^^ 
orfourofthesemoufsare^^^^^^ 

and hence mght have a direct K^^^ j recombinants is in fact heavily 
r'^'^^^m'^BZZ lT'u^^^^^ personal communication). In light o 
glycosylated ^MV can prevent the association of specific viral 

recent ^al^^^^^^ 1989) a role for the HCMV HLA homolog m 

antigens with MHC (Del val ei ai ;, compete with cellular 

infected cells can be proposed. That. st^^^^^ y ,,„3,q,ently 

HLA for the binding of one or '""I'J.Pf '^^^^^^^^ 1989). While it 

interfere with their presen^tion ""^Jjf " ^"^^^^ be dur to the HLA 

is also possible ^ ^^^^'^^^l^^^.f^Z hi^blen presented (Stannakd 
homolog. no evidence or a link be^we^^^^^^^^ J^^ ^^^^ 

''''' ""T^tl m frorv'aLTa vStors it is capable of associating with ^.m 
expressed with P^m Irom vaccm ^ Minson, personal 

which canthenb^^^^^^^^^^ 

to be unique to ^-herpesviruses. 

6.8.3 T-Cell Receptor Homology 

„..»;vP than the identification of a HLA homolog is the finding that 
Even more provocative than tne laenuiicai „, ^.uke ecne encodes a protein 
HCMV-UL20. which is in close proximity f^^^^^'^ unpublished 

with similarity to --^^^^^^^^^ 

observations). However, the "^f^^J '^J^^^^^^^^ TCRy-regions is possible. The 
region with both the constant (Q) and vanabk (V^^^ V g J 
formeralignmentshowsappro^^ ^^.^^ ^^^^^^^^ 

latter has approximately f /° '^^"'"^ 

alignment matche^^^^^^^^^^^ « alignment 

the of homology including a highly conserved 

contains ^V^^^^^^T;;;^^^^^^ bond formed within Vy may not be conserved; 

I J^LVnlUscUrthatno^ 

brane domain. It is Clear tnai no ^ homolog, 

et al. 1987). 



Analysis of the Prolein-Coding Content 157 



7 Gene Families 



In addition to gB and gH, several small glycoprotein genes were identified in 
HCMV, in US (Weston and Barrell 1986). These are arranged tandemly and tend 
to cluster as homologous blocks of reading frames, constituting a large proportion 
of the gene families found in HCMV. Interestingly, the HSV US glycoprotein genes 
are also clustered (Davison and McGeoch 1986; McGeoch et al. 1988a). We 
currently recognize nine sets of homologous genes in the AD 169 genome. There are 
three pairs (UL25 and UL35; UL82 and UL83; and US2 and 3) and six larger 
groups. Of the latter, three occur in US where they account for a total of at least 21 
genes (Weston and Barrell 1986); one family occurs in UL and RL; and two 
families are partitioned between the long and the short regions of the genome 
(Table I). The discovery of redundant protein coding sequences outside repeat 
regions was unexpected and presents a contrast to those single genes encoding 
multiple products (for example, see Sects. 6.4 and 6.5). Their presence also appears to 
contradict the virally frugal gene layout of HCMV. As individual family members 
are likely to have subtle differences in function, this paradox may be difficult to 
resolve. The characteristics of four gene families are discussed below. Proteins have 
been recognized for three of these, while the fourth is homologous to a class of 
cellular receptors. The evolutionary implications of these findings are discussed in 
Sect. 8. 



7.1 The RLU Family 

This family comprises fourteen members distributed in the long repeats and a 
portion ofUL adjacent to TRL (Table 1; Fig. 1). The sequences are characterized by 
a motif which resembles the cellular Thy-1 in a region which is conserved with some 
other members of the immunoglobulin superfamily (CA. Hutchison III, un- 
published observations). The members of the RLll Family are predicted to be 
membrane glycoproteins (Table 3). This prediction has been substantiated by the 
immunological detection of the Towne UL4-equivalent protein in infected cells and 
virions (Chang etal. 1989a). The detected 48 kd protein is expressed during the 
early phase of infection, and its presence in virions has led to its classification as an 
early structural glycoprotein (Chang et al. 1989a). Its published amino acid 
sequence is 84% identical to UL4 over 150 amino acids. Multiple alignment of the 
RLll family suggests that UL4 (which does not contain an anchor sequence) may be 
spliced to UL5 (which has an anchor but no signal or N-glycosylation sites), as their 
respective RLll homologous regions appear to dovetail somewhat. However, 
splicing was not observed in transcript mapping experiments (Chang et al. 1989b). 
Nevertheless, Chang et al. (1989a) detect a protein reduced in size from 48 kd to 
27 kd protein when infected cells are treated with an inhibitor of N-linked 
glycosylation, although the theoretical size of UL4 alone is approximately 17kd. 
While this difference could be attributable to other post-translational modifications, 
it is noteworthy that the theoretical size of RLl 1, which is homologous to both UL4 
and UL5, is approximately 27 kd. The mapped transcripts, which are initiated from 



158 M. S. Ch#et al. 



three different promoters, also contain the UL5 reading frame. Hence it may be of 
ZTrcsi to furtSer characterize the 27kDA protein. UL8 .s truncated sjm.larly o 
UL5 and therefore is also a candidate for splicing. As both these frames also contam 
KozIk consensus ATG codons, a potential exists for the expression of th.s gene 
family to be regulated in a complex manner. 

12 The US6 Family 

This family corresponds tofamilyZdescribedby WESTON and BARRELL(1986)andis 
characteriLd by two areas of sequence homology, the second of which (region 2 
WESTON and Barrell 1986)) is less well conserved. The region 1 core mot^can be 
defined as C(VY)x(DQKR) (7-10) WxxxGxF where the bracketed residues are 
aUematives and x is any residue. The region 2 motif is charactenzed by cysteine and 
proline residues: PCxxC (4-6) CxPxxxxPWxP. The six members of this farn'ly are 
p Sed to be membrane glycoproteins (Tables 1 and 3). Gretch et al^(1988b) 
Lve recently used a M Ab to demonstrate that this family correlates with the gp47- 
52 virion envelope glycoprotein complex they described previously (Gretch et al. 
1988a) Northern hybridization revealed three early transcripts from this region, two 
of which were minor species. The 1.6-kb size of the major transcript was consistent 
with initiation from the HCMV-USl 1 (HXLFl)TATA box, and ,n vitro translation 
rxp^nments suggested it was bicistronic in nature. Gretch et al. (19 8a sugges^ on 
thrbasis of these data and amino acid composition analysis that the main 
constituents of gp47-52 might be HCMV-USIO and USl 1 proteins. However, no 
direct correlation was established between the abundance of the putative transcript 
and the composition of gp47-52. 

73 The US22 Family 

This family is distributed in UL, US and RS and sequences for eight of the thirteen 
recognized members have been published, including the family 4 members described 
by W^TON and Barrell (1986). Genes attributed to this family contain one or more 
o?thrS sequence motifs (Kouzarides et al. 1988). The first ""o'-f (°°CCxxxLxxoG 
where o islny hydrophobic residue and x any residue) is found in all of the member 
Txc^pt IRSATRSl and UL28. Interestingly, in HCMV-UL36 the J-^^'onf ^ 
and 2 occurs immediately before the motif (Kouzarides et al. 1988). As HCM V- 
UL42 ends within the motif (FLCCDKFLPG- COQ-), u --f^P^^^'.^" 
gene and perhaps other members of the family apart from HCMV-UL36, encode 
S"ranscripts. The remainder of the pattern comprises two motifs which are 

r gdy hydrophobic and may overiap in function. The IRSATRSl genes, identica 
over most of their length, diverge shortly after the third mot. Apart from th 
conserved motifs, several of these sequences contain short runs of charged residues 

n S carboxy-terminal domains, and 6 of the 12 members of the US22 gene family 
have at least 1 N-Hnked glycosylation site. However, there does not appear to be any 
obvious correlation between these latter features. The only present correlation 



Analysis of the Proicin-Coding Conicnl 



159 



between this gene family and viral proteins comes from the identification of the 
HCM V-US22 gene product ICP22. This is an early protein localizing in the nucleus 
which is also detectable in the cytoplasm and may be secreted from infected cells 
{MoCARSKi et al. 1988). Interestingly, the MAb used identifying US22 does not 
appear to recognize any of its homologs. 



7.4 The G-Protein Coupled Receptor (GCR) Family 

Several HCMV-reading frames, mostly located in US, are predicted to be integral 
membrane proteins capable of spanning the membrane several times (Table 1 ). All of 
these have seven potential membrane-spanning regions. Three of the reading frames, 
HCMV-US27 and HCMV-US28 (originally named HHRF2 and HHRF3; Weston 
and Barrell 1986), and HCMV-UL33, show homology to the opsin family of cell 
surface receptors (Chee et al., submitted). Members of this diverse family of receptors 



HCnUUL33 

HCnUU$27 

HCnUU528 

RKOOOPSIN 

S-2-ADR 



HCnUUL33 

HCnUUS27 

HCnUUS26 

RKOOOPSIN 

8-2-flOR 

nRR 



HCnUtJL33 

HCnUUS2? 

KCnUUS28 

RHOOOPSIH 

B-2-flOfl 

nflR 



KCnUUt.33 

Hcnuus27 

HCnUiJS28 

RHOOOPSIH 

0-2-ROR 

nflR 



Si 61 71 ei 

flULKIFllFUCGPl^!IHn^l I0LLT«RULGVSTPTIVnTMLv6[lH 
TTILVV RRKKKSPSOTVICM.fl.pO 



1 n 21 3) It 

niCPLFHIRTTE 

nTTSTMH^TLTQU SMTITMHTLMSJE I VQLFEYTR LGUULnC!UCTFL|hL|LU 

nTPTTTTflELTTEFOV OEOflTPCUFTOULHiJSKPUT LFLVCUUFLFCSKM^Ll 

nriGTECPNFVUPFSHKTCUURSPFEfiPQVYLflEPUOFSnLflfiVnFLL intCFP I M = L T .VUTU 
nCQPCHCSflFLLR PhRSHRPOHQUTQQRDEUUUUGnG I UHSL t ULfi I UFC H J LU t Tfl I R 

HHTSflPPfiUS PMITULflPGKCPUO URF I Gl TTGLLSLRTUTtjMULy. I SF 



FTITU RRRlOCSGOUYFIM.fl^lD 



OHKKLRTPLHYILLh.fl 
KFERLOTUTHVFITS.fi 
KUMTELKTUMMVLLS.fl 



91 lOt 111 121 131 HI 151 161 171 

fLTLTULPF I ULSHQULL PflGUflSjckFlSU I VVSSCTUGFRTURL I flrfOflvj^UL HK RTVflflOSVR^n I LLLlfiLfiGL I FSiJp] 
ijLIUUGLPFFLEYftKHHPK LSREOUC ;CLHflCFVICLFfiGUCFLIttLSrORVCUIUUGU£l.NRUR«MICFflICUUUIFlilLfiULnGrP|l 

SLflsupplrLLTflCFVunnFflSLCF 1 TE I flUDRv^fl 1 uv nRVRPURC R :lf S I Ft U I FflU 1 1 fl ip 



L LFUCTLPLUnOVLLOHN 
L FnUFGGFTTTLYTSLHGYFUFGPTGt 
iJunGLnUUPFGflflHtLniCnUTFGHFU: 
IGTFSnHLVTTVLLnGHURLCTLfl C 



iUGFFflTLGCEIRLUSLUULRlERVKUUCKPn SNFRFGEM^fl^^GURF7U JnflLRCflflPj 



IFUTSlOULCUTflSIETLCUIfllOflVFfllTSPFKYOSLLTKMitfl^UI ILPIU 



JLULRLOVUflSNRSUnHLLL I Sf ORV FSUTRPLSVRfllCRTPRF fi I GLf U .US FULUI 



USGLTSFLP 



lU 



16t 19t ' - 201 

UVTTUUnHHORHOTNHTNCHRT 



IV 



251 



LPYSHT 
nUUTKIC 
UG USRYI 
gnHUYRRTH 
ILFUOYtU 



OKC 
PECnQC£ 
QERU 



211 22t 231 21i 

aJLVFUflEEUHTULLSUKULLTnUUGflfjPiUinnTUFjYhFFVSTU 
C LJGEFRMETSGUFPUFLHTICUH I CGYLf P I RLnflVT V ifinURF ) 
C1T0Y0YLEUS VP I ILHUEinLGRFUIP.SUISYCYi'RI Sfll U 
C 31 OYYTPHEETHHESFU I YHFUUHF 11 P . lU I FFC V SOLUFTUKEflflflOOOESRTTQKREKEUTRnU 
C YRHETCCOFFTM QflVfl I RSS I USFYL P LU I nUFl Y Sfi UFQER/30/GLRflSSKFCLICEHKflLKTL 



CERTULflGCCVlQFLSOPIlTF 



261 

QRTSOICOflSflTLTFU 
IMVUGKUHnOTLHUL 
RUSQSRHICGRIURUL 



CTflnRRFVLPLJTUnCTLVJR lYRET/ 1 31/flKflKTFSLUKEKKflfiRTL 




291 301 311 321 

USLn I FHSYRTTflUP nQCEHLTLRRT 1 GTLRRUUPHLHCL I[m 
HLRLFLESlRLLfiG UVHOTLQHU I IFCLVUGQFLRVURflCLl 
/HLTLFUOTLKLlKU 1 SSSCEFERSLKRRL I LTESLRFCHCcJ 
^RGURFY I FTHOCSOFCP I FHT ( PRFFRKTSflU^ 

F I UN I UHU I QONL I RKEUV I LLHU t GYUHSGfI 

/N t nULUST FCKOCUPETLUELGYULCVUMST im 



311 351 
ULGHdFtCQRnROCFRGOLLORftflFLRSO 
1 LUCTC nncOnUTTLRUFRCCCUKOE 1 PY 
JFUGlr Ffl ChVTUCUPSFflSOSFPflnVPG 
innHKCFPICnUTTLCCGKHPLCOOEflST 
:flSP IFFIRFOELLCLRRSSLKRVGHGY 
UCMKfjFy^f'RLLLLCRUDKRRUflKlPK 



Fig. 4. An alignment of the three HCMV G-proiein-coupled receptor homologs with bovine rhodopsin 
(Nathans and Hogness 1983), human /?-2-adrenergic receptor (B-2-ADR) (Kobilka et al. 1987), and 
porcine muscarinic acetylcholine receptor (MAR) (KuBOet al. 1986), The NXT/S motifs are underlined in 
the N-ierminal extracellular domain and identities which correspond in at least five of the six sequences 
are boxed. The seven membrane-spanning helical domains are indicated by numbered bars beneath the 
alignment. Each transmembrane domain and its disposition is defined by a motif unique within the 
sequence. The alignment has been truncated within the cytoplasmic C-terminal domains which possess 
receptor-specific functions, and sections of 30 and 134 amino acids have been excised from the B-2-ADR 
and MAR sequences respectively beginning at position 248. The two conserved cysteine residues al 
alignment positions 117 and 203 have been shown to be essential for function in bovine rhodopsin 
(Karnik et al. 1988) 



160 M.S..Ch*ctai: 



m^ry and learning. Bid "i^^'j;;,^,,,, 3„b6toaps otthjs 

HCMV-US27. f ^Y,^f,ra„ii„o adds •^^^^^■"tl'. 

8 Relationships .o a ^ r-H«P«vir»s Genomes 

„.a.— se,ncnceda..ave^^^^^ 

thecvolu arrangements of gen^^ *"°P „,^ c^uows the relationships of 

resulting '""^Is a recombinatorial junctions. Figure 5 shows tn ^^^^^ 
formation of genes^te^ ^^^^ ^^.^^^ '"^^°"^f VZV EBV, and HSV-1 

Svi^^seTare grossly -^or^^^^^^[^,^^^^^^^^^^ between all three of the 

fr!h LfamiliesofherpesviniseshavedivergeOK. ^^^^ ^^^^^^^ ^^em. 

tTm:^^ than this core set ^^.^v rrore closely related to EBV th^^^^ 
However, at the protein sequence I'^^ji^^^.Vblock show widely varying levels of 

following distinctive features oi 



evolutionary past: 



Analysis of the Protein-Coding Content 161 





-cf--„7-- 
S 



:z3 




« y-s g| 2i 



2 o 



' s A 0* s o 




e5 



•g"§>i> 



" S ? < « o -o -o 



■3 = 



o.3r o o g 2 « 5 
o 



2 a.tt,E.>$— 5; 



, 5! -2 o i -s 15 ^ 
^ffl g = " 



O * W = 



■o K-6 




eg 

U 
X 



4> . i S C — 

S « S c -2 8 = 



c « o . 



o o « ^ t: o 
^ Z e .2: o ^ ^ 

g 00 « > il ° X 

a> ^ _c -r u ^ C 
2: ^ — x: j= a 

o8e««o£N=E£ 

^ — • ft ^ 1-1 ai ftJ 



« o 
00 — 

^ o ja 

or 



c - 
w « — 
^, S o « 



in c: 



U. D > 



00> 



« O w « N 

= 3 ^■£> 



lOi M. S. Che»»ct at 



iu*- 

genes . HCMV 

appear to lie between ^J^ST^L^^^^^^^^ ^.^.^ty of the glycoprotein 

in contrast the of the prototype genome. Members of 

genesliew.thmUSandinULatl«^^^^^ 

two families (the RL l and ^^22 famme , ^^^^ 

TWO families (the US22 and GCfJ^^ ^^^^^^^^^^^ ,hat the RLll, US2, and US6 
thelongregionsofthegenon^.It^^^^^^^^ 

families, together with H^MV U^^^ j^^g regions. These 

family" which is also glycoprotein exons) which are 
sequences all encode g^y^P^^^^''^' sequence alignment 
Mostly in the range ^^^^ US2 and US3 and some 

reveals short regions of rlu family anchor sequences are 

members of the RLUfami y. For «ampk^^ (Table 3). The 

characterized by /.J^^j^^"^ "fj ^^^^^^^ families also show some similanty. and 
distinguishmg motifs of the RLll ana 

may also be echoed in HCMV-US34. 

"mitr"' Cxx(QEKR) (7-10) W - GxF 

^ Cxx (NQEKTV) (4^6) (VFLI) Nx (ST) xxxx^^^^ GxV 
«rMV-US34 CLAE VOVA 

nnally, the majority of the genes — ^^^^^^^^^ 
copies. These observat ons suggest hat the^HCM^^^ 

expanding by gene duphcaticm and dwe^^^^^^ ^^^^ ^^33^ which 

by the HCMV DNA «P''<=*'';" J"^,^^^^^^^^^ 
Jay be related to ^-^--'^'^ZcG^tm ^^^^^ 

CLEMENTS 1984; DAVISON and McGE^^^ s of 

been at least o««/<^°"''';f"Xu^^^^^^^^ fa^nilies between both regions. A 
HCMV which led to the d^tn^u^'o^^^^ g^n .^^ ^ , 

possible scenario for such an ^^"^ ^f^^';^^,,, „on-inverting genome to a four- 
«gment leading to the conversion f[f" ^^'^^f of the two new subsegments 
isomer genome. Genes contraction of the repeats. The 

might then diverge together wuh t^^^^^^^ ,^,p ^^rify the 

characterization of ° 'IJJ,^;^^^ ^^^^^ 

evolutionary history ^^^l^'t^,^, charcteristic of the ^-herpesviruses. 
HCMV for gene duplication is a gcncia v 



9 Perspectives 



Analysis of the Prolein-Ccxiing Content 



163 



which unify this highly divergent group of viruses are now coming into focus at the 
genetic level The sequences have facilitated the correlation of biological and genetic 
experiments, and allowed much of this work to be generalized. The growing body of 
relational knowledge should make it increasingly informative to begin the 
characterization of herpesvirus genomes by sequencing. These data will continue to 
provide predictions which can be tested, and which promise to shed further light on 
the herpesviruses and their eukaryotic environment. 

Acknowledgments. We thank Jon Oram and Peter Greenaway for providing the 
Hindlll clones used in the sequence analysis and Bernard Fleckenstein for providing 
the cosmid clones used for determining the overlaps of the HindUl sites. We are 
grateful to Tony Minson and Helena Browne for advice and for making available 
results prior to publication, and to Mark Stinski for comments on parts of the 
manuscript. M.C. thanks the Commonwealth Scholarships Commission for 
support. 



References 



Addison Q Rixon FJ, Palfreyman JW, O'Hara M, Preston VG (1984) Characterisation of a herpes 

simplex virus type 1 mutant which has a temperature-sensitive defect in penetration of cells and 

assembly of capsids. Virology 138: 246-259 
Akrigg A, Wilkinson GWG, Oram JD (1985) The structure of the major immediate early gene of human 

cytomegalovirus strain AD 169. Virus Res 2: 107-121 
Anders DG, Gibson W (1988) Location, transcript analysis, and partial nucleotide sequence of the 

cytomegalovirus gene encoding an early DNA-binding protein with similarities to ICP8 of herpes 

simplex virus type 1, J Virol 62: 1 364- 1 372 
Avertt DR, Lubbers C, Elion GB, Spector T (1983) Ribonucleotide reductase induced by herpes simplex 

virus type I, Characterisation of a distinct enzyme. J Biol Chem 258: 9831-9838 
Baer R, Bankier AT, Biggin MD, Deininger PL, Farrell PJ. Gibson TJ, Haifull G, el al. (1984) DNA 

sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 310: 207-211 
Bairoch A (1988) Swiss-Prot protein sequence data bank release 8.0. Department de Biochimie Medicale, 

Centre Medical Universitaire, Geneva 
Bankier AT, Barrel! BG (1989) Sequencing single strand DNA using the chain termination method. In: 

Ward S, Howe C (eds) Nucleic acids sequencing: a practical approach. IRL, Oxford (in press) 
Bankier AT, Weston KM, Barrel] BG (1988) Random cloning and sequencing by the 

M 1 3/dideoxynucleotide chain termination method. Methods Enzymol 155: 51-93 
Batterson W. Furlong D, Roizman B (1983) Molecular genetics of herpes simplex virus VIH. Further 

characterization of a temperature-sensitive mutant defective in release of viral DNA and in other 

stages of the viral reproductive cycle. J Virol 45: 397-407 
Beck S, Barren BG (1988) Human cytomegalovirus encodes a glycoprotein homologous to M HC class- 1 

antigens. Nature 331: 269-272 
Benko DM, Haltiwanger RS, Hart GW, Gibson W (1988) Virion basic phosphoprotein from human 

cytomegalovirus contains 0-linked /V-acetyl glucosamine. Proc Natl Acad Sci USA 85: 2573-2577 
Biron KK, Fyfe JA, Stanat SC, Leslie LK, Sorrell JB, Lambe CU, Coen DM (1986) A human 

cytomegalovirus mutant resistant to the nucleoside analog 9-{[2-hydroxy-l- 

(hydroxymethyl)ethoxy]mcthyl} guanine (BW B759U) induces reduced levels of BW B759U 

triphosphate. Proc Natl Acad Sci USA 83: 8769-8773 
Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987a) Structure of the 

human class t histocompatibility antigen, HLA-A2. Nature 329: 506-512 
Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987b) The foreign 

antigen binding site and T-cell recognition regions of class I histocompatibility antigens. Nature 

329: 512-518 



164 M. S. Chcc ct al. 



cell receptor 7/CD3 <=«'"P;f« f"'"^ " S^^^^^^^^^ Schaffner W (1985) A very strong 

Bre2ferS(198pPhosphot«^er^^^^^^^ 

t'^u'^SJ^^^^^^^^^ on ^Hphera. blood cytotoxic T lymphocytes. Nature 

a«'n\'cT;S.eDH.NelsonJ.O.<.stoneMBA,SU„s™ 

hian •cy.omegalovi^s e^lyfy^PJ^^^- "toSovirus early gene has three inducible 

Chang C-P. Malone CU St.nsh MF (l^ A ^-^^ .^^^^^^ j yj^^, gj.. 281-290 

%?.e\r;2^«P-h^^^^^^^^ .„_,,3tes the a promoter- 

Cherrington JM, Mocarski ES (1989) Human cywrneg ,435_i44o 

eThScer via an >^ba«-pair «p^ sequence-specific proteins 

^"binSSX'Sifo^^^^^^^^^ 
aa;S£aK2'DircctoUTingY.P(1984) isolation^ 

%Kton%^opro.ein Of hum^^^^^^^^ ^^^^.Jf^, .,pe , 

Costa RH, Draper KG. KeUy ^J. Wagner EK^^^^^ P ^.^^^ jj^ ^^S 

transcript with sequent homo ogy to f If ^^J^ kW. Tomlinson P, Barrell EG, et al. 

Cranage MP. Kouzarides T. B^nkier AT. Sawh^^^^ 

(l986)IdentirHationofthehumancytome^b^^^^ 8 3057-3063 

antibodies via its expression m '«*0'"'';"""'<r'*S^ P.et al.(1988)Idenlincaiion 

CranageMP,SmithGUBeUSE.Hart H Brow^ ,he Epstein-Barr virus 

CruteV/^MoiiSci ES. Uhman IE (1988) A DNA helicase induced by herpes simplex virus type 1. 
Nucleic Acids Res 16: 6585-6596 challberg MD, Mocarski ES, Uhman IR (1989) 

D.S;AJ.T.,K,,P(.9.7,0™.,= ..l.ao«.l«w«....«.~.i™-Ep«.in-B,„.l,«0 

Gen Virol 68: 1067-1079 Kn^zinowski UH (1989) Presentation of CMV immediate-early 

OeStS^/S^) Human cytomes.ovin.0^^^^^ ^^..^2^-38"^^^^ "'^^ '"^ 

oerS(U?^'?o':;rni^^^^^^^^ 

Oei?A^'?;:?^.RM(.989)ReguU^.p^^^^^^^ 
•sequence in the promoter -i'^"^^"^ "^^"f^^^^ 



virion infectivity. J Gen V.rol 69: 1147-1156 



Analysis oi me ^ruiein-couing content , loj 

Dohlman HG, Caron MG, Lefkowitz RJ (1987) A family of receptors coupled to guanine nucleotide 

regulatory proteins. Biochemistry 26: 2657-2664 
Dorsch-Hasler Keil GM, Weber F, iasin M, Schaffner W, Koszinowski UH (1985) A long and 

complex enhancer activates transcription of the gene coding for the highly abundant immediate early 

mRNA in murine cytomegalovirus. Proc Natl Acad Sci USA 82: 8325-8329 
Engstrom Y, Francke U (1985) Assignment of the structural gene for subunit MI of human 

ribonucleotide reductase to the short arm of chromosome II. Exp Cell Res 158: 477-483 
Farrar GH, Greenaway PJ (1986) Characterization of glycoprotein complexes present in human 

cytomegalovirus envelopes. J Gen Virol 67: 1469-1473 
Ferguson MAi. Williams AF (1988) Cell-surface anchoring of proteins via glycosyl-phosphalidylinositol 

structures. Annu Rev Biochem 57: 285-320 
Fickenscher H, Stamminger T, Ruger R, Fleckenstein B (1989) The role of a repetitive palindromic 

sequence element in the human cytomegalovirus immediate early enhancer. J Gen Virol 70: 107-123 
Fleckenstein B, Muller I, Collins J (1982) Cloning of the complete human cytomegalovirus genome in 

cosmids. Gene 18: 39-46 
Fulton R, Forrest D, McFarlane R, Onions D, Neil JC (1987) Retroviral transduction of T-cell antigen 

receptor ^-chain and myc genes. Nature 326: 190-194 
Geballe AP, Mocarski ES (1988) Translational control of cytomegalovirus gene expression is mediated 

by upstream AUG codons. J Virol 62: 3334-3340 
Geballe AP, Spaete RR, Mocarski ES (1986a) A c/s-acting element within the 5' leader of a 

cytomegalovirus p transcript determines kinetic class. Cell 46: 865-872 
Geballe AP. Leach FS, Mocarski EM (1986b) Regulation of cytomegalovirus late gene expression: y 

genes are controlled by posttranscriptional events. J Virol 57: 864-874 
George DG, Barker WC, Hunt LT (1986) The protein identification resource (PIR). Nucleic Acids Res 

14: 11-15 

Ghazal P. Lubon H, Fleckenstein B, Hennighausen L(1987) Binding of transcription factors and creation 
of a large nucleoprolein complex on the human cytomegalovirus enhancer. Proc Natl Acad Sci USA 
84; 3658-3662 

Ghazal P. Lubon H, Hennighausen L (1988) Specific interactions between transcription factors and the 
promoter- regulatory region of the human cytomegalovirus major immediate-early gene. J Virol 
62: 1076-1079 

Gibson W (1983) Protein counterparts of human and simian cytomegalovirus. Virology 128: 391-406 
Gibson T, Stockwell P, Ginsburg M, Barrell BG (1984) Homology between two EBV eariy genes and 

HSV ribonucleotide reductase and 38K genes. Nucleic Acids Res 12: 5087-5099 
Goins WF, Stinski MF(I986) Expression of a human cytomegalovirus late gene is posttranscriptionally 

regulated by a 3'-end-processing event occurring exclusively late after infection. Mol Cell Biol 

6:4202-4213 

Compels UA, Craxton MA, Honess RW (1988a) Conservation of gene organization in the lymphotropic 

herpesviruses herpesvirus saimiri and Epstein- Barr virus. J Virol 62: 757-767 
Gompels UA, Craxton MA, Honess RW (1988b) Conservation of glycoprotein H (gH) in herpesviruses: 

nucleotide sequence of the gH gene from herpesvirus saimiri. J Gen Virol 69: 2819-2829 
Greenaway PJ, Wilkinson GWG (1987) Nucleotide sequence of the most abundantly transcribed eariy 

gene of human cytomegalovirus strain ADI69. Virus Res 7: 17-31 
Gretch DR, Kari B, Rasmussen L, Gehrz RC, Stinski MF( 1988a) Identification and characterization of 

three distinct families of glycoprotein complexes in the envelopes of human cytomegalovirus. J Virol 

62: 875-881 

Gretch DR, Kari B, Gehrz RC, Stinski MF (1988b) A multigene family encodes the human 

cytomegalovirus glycoprotein complex gcll {gp47-52 complex). J Virol 62: 1956-1962 
Grundy JE* McKeating JA, GrilTuhs PD ( 1987a) Cytomegalovirus strain ADI69 binds microglobulin 

in vitro after release from cells, J Gen Virol 68; 777-784 
Grundy JE, McKeating JA, Ward PJ, Sanderson AR, Griffiths PD (1987b) Microglobulin enhances 

the infectivity of cytomegalovirus and when bound to the virus enables class 1 HLA molecules to be 

used as a virus receptor. J Gen Virol 68: 793-803 
Heilbronn R, Jahn G, Burkle A, Freese U-K, Fleckenstein B, zur Hausen H (1987) Genomic localization. 

sequence analysis, and transcription of the putative human cytomegalovirus DNA polymerase gene. J 

Virol 61: 119-124 

Hennighausen L, Fleckenstein B (1986) Nuclear factor 1 interacts with five DNA elements in the 
promoter region of the human cytomegalovirus major immediate early gene. EM BO J 5: 1367-1371 

Hermiston TW, Malone CL, Witte PR, Stinski MF (1987) Identification and characterization of the 
human cytomegalovirus immediate-eariy region 2 gene that stimulates gene expression from an 
inducible promoter. J Virol. 61: 3214-3221 

Hodgman TC (1988) A new superfamily of replicative proteins. Nature 333: 22-23 



M. <^nc*«i ai. 



loo 

mplex-- diverse observations and a unifying 

the formation «f ' 'Swski UH (1987) Sequence and structural organiza 
KouzandesT, ^^^^'^^^ p. , wgrf i * 47-58 „ n.r^n RG (1987a) Sequence and 

Kozak M (1982) ^nalys^of nbosome bmO g ^^^^^ 
sequenang and expression oi ^ p . „ M^d Virol 35: 152-185 



Analysis oi tne frotein-coaing Content 



167 



Lee JY, Irmiere A, Gibson W (1988) Primate cytomegalovirus assembly: evidence that DNA packaging 

occurs subsequent to B capsid assembly. Virology 167: 87-96 
Littler E, Zeuthen J, McBride AA, Trost-Sorensen E, Powell KL, Walsh-Arrand JE. Arrand JR (1986) 

(dentification of an Epstein-Barr virus-coded thymidine kinase. EMBO J 5: 1959-1966 
Mach M, Viz U, Fleckenstcin B (1986) Mapping of the major glycoprotein gene of human 

cytomegalovirus. J Gen Virol 67: 1461-1467 
Marschalek R, Amon-Bohm E, Stoerker J, Klages S, Fleckenstcin B, Dingcrmann T (1989) CMER, an 

RNA encoded by human cytomegalovirus is most likely transcribed by RNA polymerase IH. Nucleic 

Acids Res 17: 631^3 
Martigneiti J A (1987) Sequence analysis of HCMV. Dissertation, Cambridge University 
Martinez J, St Jeor SC (1986) Molecular cloning and analysis of three cDNA clones homologous to 

human cytomegalovirus RNAs present during late infection. J Virol 60: 531-538 
Martinez J, Lahijani RS, St Jeor SC (1989) Analysis of a region of the human cytomegalovirus (ADI69) 

genome coding for a 25-kilodatton virion protein. J Virol 63: 233-241 
McDonough SH, Spector DH (1983) Transcription in human fibroblasts permissively infected by human 

cytomegalovirus strain AD 1 69. Virology 125: 31-46 
McDonough SH, Staprans SI, Spector DH (1985) Analysis of the major transcripts encdoded by the long 

repeat of human cytomegalovirus strain AD169. J Virol 53: 711-718 
McGeoch DJ (1985) On the predictive recognition of signal peptide sequences. Virus Res 3: 271-286 
McGeoch DJ (1987) The genome of herpes simplex virus: structure, replication and evolution. J Cell Sci 

[Suppl] 7: 67-94 

McGeoch DJ, Davison AJ (1986) Alphaherpes viruses possess a gene homologous to the protein kinase 
gene family of eukaryotes and retroviruses. Nucleic Acids Res 14: 1765-1777 

McGeoch DJ, Dalrymple MA, Davison AJ, Dolan A, Frame MC, McNab D, Perry U. et al. (1988a) The 
complete sequence of the long unique region in the genome of herpes simplex virus type 1. J Gen Virol 
69: 1531-1574 

McGeoch DJ, Dolan A, Frame MC (1986) DNA sequence of the region in the genome of herpes simplex 
virus type 1 containing the exonuclease gene and neighbouring genes. Nucleic Acids Res 14: 3435- 
3448 

McGeoch DJ, Dalrymple MA, Dolan A, McNab D, Perry L, Taylor P, Challberg MD ( 1988b) Structures 

of herpes simplex virus type 1 genes required for replication of virus DNA. J Virol 62: 444-453 
McKnight SL (1980) The nucleotide sequence and transcript map of the herpes simplex virus thymidine 

kinase gene. Nucleic Acids Res 8: 5949-5963 
Meyer H, Bankier AT, Landini MP, Brown CM, Barrell BG, Ruger B, Mach M (1988) Identification and 

procaryotic expression of the gene coding for the highly immunogenic 28-kilodalton structural 

phosphoprotein (pp28) of human cytomegalovirus. J Virol 62: 2243-2250 
Mocarski ES, Roizman B (1982) Structure and role of the herpes simplex virus DNA termini in inversion, 

circutarization and generation of virion DNA. Cell 31: 89-97 
Mocarski ES, Pereira L, Michael N (1985) Precise localization of genes on large animal virus genomes: 

use of Agtl I and monoclonal antibodies to map the gene for a cytomegalovirus protein family. Proc 

Natl Acad Sci USA 82: 1266-1270 
Mocarski ES, Pereira L, McCormick AL (1988) Human cytomegalovirus ICP22, the product of the 

HWLFl reading frame, is an early nuclear protein that is released from cells. J Gen Virol 69: 2613- 

2621 

Mullaney J, Moss HWMcL, McGeoch DJ (1989) Gene UL2 of herpes simplex virus type I encodes a 

uracil-DNA glycosylase. J Gen Virol 70: 449-454 
Nathans J (1987) Molecular biology of visual pigments. Annu Rev Neurosci 10: 163-194 
Nathans J, Hogness DS (1983) Isolation, sequence analysis, and intron-exon arrangement of the gene 

coding bovine rhodopsin. Cell 34: 807-814 
Nikas I, McLauchlan J, Davison AJ, Taylor WR, Qements JB (1986) Structural - features of 

ribonucleotide reductase. Proteins 1 : 376-384 
Olivo PD, Nelson NJ, Challberg MD (1988) Herpes simplex virus DNA replication: the UL9 gene 

encodes an origin-binding protein. Proc Natl Acad Sci USA 85: 5414-5418 
Oram JD, Downing RG, Akrigg A, Doggleby CJ, Wilkinson GWG, Greenaway PJ (1982) Use of 

recombinant plasmids to investigate the structure of the human cytomegalovirus genome. J Gen Virol 

59: 111-129 

PachI C, Probert WS, Hermsen KM, Masiarz FR, Rasmusscn L, Merigan, TC, Spaete RR (1989) The 
human cytomegalovirus strain Towne glycoprotein H gene encodes glycoprotein p86. Virology 
169:418-426 

Pande H, Baak SW, Riggs AD, Clark BR, Shively JE, Zaia JA (1984) Cloning and physical mapping of a 
gene fragment coding for a 64-kilodalton major late antigen of human cytomegalovirus. Proc Natl 
Acad Sci USA 81: 4965-4969 



mi s cn<*ei ai; 

l^** * r .-^nnftheecne encoding 

^ - I A n am Genomic localization of me gene 

:srS?ilSss=^«— ^^^^ 

sequence of the gene for pseudorao MO««.rans-Aclivation and auioregula lion 

capsids.JGenV.ol69-2»^^^^^^^^^^ 
dSrfSUm*""''*'"" . ,v Bn.SJ Hun. T 11985) 

sJifi.rsSTFU989,.a---i";;)TS^^^^ 

n988) Human cytomegalovifub m^o .ofiware. Nucleic Acids 

StadenR (1986) The current status r „,otifs in sequences. CABIOS 4: 53-60 

S Je^ i^U^KUods to denne and .ocate patterns of .ot.fs 



Analysis of the Protein-Coding Content 169 

Slannard LM ( 1 989) /?2 microglobulin binds to the tegument of cytomegalovirus: an immunogold study. J 

Gen Virol 70: 2179-2184 , ^ 

Staprans SI, Spector DH (1986) 12-kilobasc class of early transcripts encoded by cell-related sequences m 

human cytomegalovirus strain AD169. J Virol 57: 591-602 
Stenberg RM, Thomsen DR, Stinski MF (1984) Structural analysis of the major immediate early gene of 

human cytomegalovirus. J Virol 49: 190-191 
Stenberg RM, Witte PR, Stinski MF (1985) Multiple spliced and unspliced transcripts from human 

cytomegalovirus immediate-early region 2 and evidence for a common initiation site within 

immediate-early region 1. J Virol 56: 665-675 . 
Stinski MF (1977) Synthesis of proteins and glycoproteins in cells infected with human cytomegalovirus. J 

Virol 23: 751-767 , . . 

Stinski M F Roehr TJ ( 1 985) Activation of the major immediate early gene of human cytomegalovirus oy 

ci5-acting elements in the promoter-regulatory sequence and by virus-specific frans-acting compo- 
nents. J Virol 55: 431-441 • f»U^ 
Stinski MF, Thomsen DR, Stenberg RM, Goldstein LC (1983) Organization and expression of the 

immediate early genes of human cytomegalovirus. J Virol 46: 1-14 
Tamashiro JC, Filpula D, Friedmann T, Spector DH ( 1984) Structure of the heterogeneous L-S junction 

region of human cytomegalovirus strain ADI69 DNA. J Virol 52: 541-584 
Thompson R, Honess RW, Taylor U Morran J, Davison AJ (1987) Varicella-zoster virus specific? a 

thymidylate synthetase. J Gen Virol 68: 1449-1455 
Thomsen DR, Stenberg RM, Coins WF, Stinski MF (1984) Promoter-regulatory region of the major 

immediate early gene of human cytomegalovirus. Proc Natl Acad Sci USA 81 : 659-663 
Townsend A, OhJen C Bastin J, Ljunggren H-G, Foster L, Karre K (1989) Association of class I major 

histocompatibility heavy and light chains induced by viral peptides. Nature 340: 443-448 
Trimble JJ, Murthy CS, Bakker A, Grassmann R, Dcsrosiers RC (1988) A gene for dihydrofolate 

reductase in a herpesvirus. Science 239: 1145-1147 
Wang F, Petti L, Braun D, Seung S, Kieff E (1987) A bicistronic Epstein-Barr virus mRNA encodes two 

nuclear proteins in latently infected, growth-transformed lymphocytes. J Virol 61 : 945-954 
Waihen M W, Stinski MF ( 1982) Temporal patterns of human cytomegalovirus transcription: mapping 

the viral RNAs synthesized at immediate early, early, and late times after infection. J Virol 41 : 462- 

477 

Weber PC, Challberg MD, Nelson NJ, Levine M, Glorioso JC (1988) Inversion events in the HSV-1 

genome are directly mediated by the viral DNA replication machinery and lack sequence specificity. 

Cell 54: 369-381 ^ . , . ^ 

Weller SK, Aschman DP, Sacks WR, Coen DM, Schaffer PA (1983) Genetic analysis of temperature- 
sensitive mutants of HSV- 1 : the combined use of complementation and physical mapping for cistron 

assignment. Virology 130: 290-305 
Weston K (1988) An enhancer element in the short unique region of human cytomegalovirus regulates the 

production of a group of abundant immediate early transcripts. Virology 162; 406-416 
Weston K, Barrell BG (1986) Sequence of the short unique region, short repeals and part of the long 

repeat of human cytomegalovirus. J Mol Biol 192: 177-208 
Whitton JL, Clements JB (1984) The junctions between the repetitive and the short unique sequences of 

the herpes simplex virus genome are determined by the polypeptide-coding regions of two spliced 

immediate-early mRNAs. J Gen Virol 65: 451-466 
Wilkinson GWG, Akrigg A, Greenaway PJ ( 1984) Transcription of the immediate eariy genes of human 

cytomegalovirus strain AD169. Virus Res 1 : lOl-l 16 
Worrad DM. Caradonna S (1988) Identification of the coding sequence for herpes simplex virus uracil- 

DNA glycosyiase. J. Virol. 62: 4774-4777 
Wright DA, Staprans SI, Spector DH (1988) Four phosphoprotcins with common amino termini are 

encoded by human cytomegalovirus AD169. J Virol 62: 331-340 
Wu CA, Nelson NJ. McGeoch DJ, Challberg MD (1988) Identification of herpes simplex virus type 1 

genes required for origin-dependent DNA synthesis. J Virol 62: 435-443 
Yang-Feng TL, Barton DE, Thelander L, Lewis WH, Srinivasan PR, Francke U (1987) Ribonucleotide 

reductase M2 subunit sequences mapped to four different chromosomal sites in humans and mice: 

functional locus identified by its amplification in hydroxy urea-resistant cell-lines. Genomics 1 : 77-86 
Zhang CX, Decaussin G, de Turenne Tessier M, Daillie J, Ooka T ( 1987) Identification of an Epsicin-Barr 

virus-specific deoxyribonuclease gene using complementary DNA. Nucleic Acids Res 1 5 : 2707-27 1 7 



